What are the responsibilities and job description for the Site Reliability Engineering Manager position at Ziga.AI?
Company Description
Ziga AI is on a mission to revolutionize Site Reliability Engineering by deploying intelligent AI agents that automate chaotic oncall processes, enhance observability, and resolve incidents proactively. This empowers SRE teams to focus on innovation rather than firefighting, making high-velocity startups and scaling tech companies more resilient and efficient. Our vision is to leverage agentic AI autonomous systems that learn, adapt, and act to eliminate the pain points in oncall engineering, from alert fatigue to root cause analysis.
Role Description
We’re hiring a former SRE or Senior Oncall Engineer to spearhead our operations. As an early team member, you’ll collaborate closely with our founding team to map out the SRE landscape in high-velocity startups, and dive deeper into oncall bottlenecks. This role is a gateway to transition from hands-on operations into AI product strategy, using your expertise to solve the very issues that kept you up at night.
What You’ll Do
- Serve as our domain expert on oncall engineering in fast-paced, high-velocity environments
- Leverage relationships with SREs, DevOps leads, and senior engineers at startups and scale-ups to understand their problems for agentic AI solutions and uncover pain points in oncall processes, observability management (e.g., monitoring stacks, alerting, dashboards), and incident response workflows
- Extract and document underlying problems, such as alert overload, siloed tools, manual triage, and scalability issues in distributed systems
- Contribute to product roadmap by validating AI agent features that automate these challenges, and advise on go-to-market strategies targeting SRE communities
Ideal Background
- 5 years of experience in SRE, DevOps, or oncall engineering roles, preferably in high-velocity startups or tech companies dealing with complex, distributed systems
- Deep familiarity with oncall rotations, observability tools (e.g., Prometheus, Grafana, ELK stack, Datadog), incident management (e.g., PagerDuty, OpsGenie), and common SRE workflows like SLOs, error budgets, and post-mortems
- Entrepreneurial mindset thrilled by the prospect of shaping AI agents from the ground up based on real practitioner insights
- Based in the US (preferably in tech hubs like SF Bay Area, Seattle, Austin, or NYC)
- Strong network in the SRE/DevOps community.
Nice to Have
- Experience with AI/ML tools in operations, or exposure to agentic systems (e.g., autonomous agents in monitoring or automation)
- Previous consulting, advisory, or community-building roles in SRE forums (e.g., contributions to SREcon, Reddit communities, or open-source observability projects)
- Startup experience or comfort in ambiguous, multi-hat environments where you drive lead generation and insight gathering
Why Join Us?
- Join a lean founding team at the forefront of AI-agentic SRE innovation
- Directly influence product development with your battle-tested oncall expertise
- Work remotely with flexible hours to suit your lifestyle.