What are the responsibilities and job description for the Site Reliability Engineering Manager position at Ziga.AI?

Company Description

Ziga AI is on a mission to revolutionize Site Reliability Engineering by deploying intelligent AI agents that automate chaotic oncall processes, enhance observability, and resolve incidents proactively. This empowers SRE teams to focus on innovation rather than firefighting, making high-velocity startups and scaling tech companies more resilient and efficient. Our vision is to leverage agentic AI autonomous systems that learn, adapt, and act to eliminate the pain points in oncall engineering, from alert fatigue to root cause analysis.

Role Description

We’re hiring a former SRE or Senior Oncall Engineer to spearhead our operations. As an early team member, you’ll collaborate closely with our founding team to map out the SRE landscape in high-velocity startups, and dive deeper into oncall bottlenecks. This role is a gateway to transition from hands-on operations into AI product strategy, using your expertise to solve the very issues that kept you up at night.

What You’ll Do

Serve as our domain expert on oncall engineering in fast-paced, high-velocity environments
Leverage relationships with SREs, DevOps leads, and senior engineers at startups and scale-ups to understand their problems for agentic AI solutions and uncover pain points in oncall processes, observability management (e.g., monitoring stacks, alerting, dashboards), and incident response workflows
Extract and document underlying problems, such as alert overload, siloed tools, manual triage, and scalability issues in distributed systems
Contribute to product roadmap by validating AI agent features that automate these challenges, and advise on go-to-market strategies targeting SRE communities

Ideal Background

5 years of experience in SRE, DevOps, or oncall engineering roles, preferably in high-velocity startups or tech companies dealing with complex, distributed systems
Deep familiarity with oncall rotations, observability tools (e.g., Prometheus, Grafana, ELK stack, Datadog), incident management (e.g., PagerDuty, OpsGenie), and common SRE workflows like SLOs, error budgets, and post-mortems
Entrepreneurial mindset thrilled by the prospect of shaping AI agents from the ground up based on real practitioner insights
Based in the US (preferably in tech hubs like SF Bay Area, Seattle, Austin, or NYC)
Strong network in the SRE/DevOps community.

Nice to Have

Experience with AI/ML tools in operations, or exposure to agentic systems (e.g., autonomous agents in monitoring or automation)
Previous consulting, advisory, or community-building roles in SRE forums (e.g., contributions to SREcon, Reddit communities, or open-source observability projects)
Startup experience or comfort in ambiguous, multi-hat environments where you drive lead generation and insight gathering

Why Join Us?

Join a lean founding team at the forefront of AI-agentic SRE innovation
Directly influence product development with your battle-tested oncall expertise
Work remotely with flexible hours to suit your lifestyle.

Apply for this job

Receive alerts for other Site Reliability Engineering Manager job openings

Site Reliability Engineering Manager

What are the responsibilities and job description for the Site Reliability Engineering Manager position at Ziga.AI?

What is the career path for a Site Reliability Engineering Manager?

Not the job you're looking for? Here are some other Site Reliability Engineering Manager jobs in the San Francisco, CA area that may be a better fit.

We don't have any other Site Reliability Engineering Manager jobs in the San Francisco, CA area right now.

AI Assistant is available now!