What are the responsibilities and job description for the Site Reliability Engineering Manager position at Calance?

The RoleYou will build and lead the Site Reliability Engineering team, owning the infrastructure that development, validation, and customer-facing deployments run on. This spans colocation facilities, on-premises lab clusters, cloud environments (AWS, Azure, GCP), and the platform services customers use to collaborate on hardware and software deployments.

You are both a people manager and a practicing engineer. You will set technical direction, hire and grow the team, own SLOs for critical systems, and be the senior escalation point when things go wrong. You will work closely with hardware and software development teams to ensure HPC infrastructure meets their workload requirements — and partner with the Senior DevOps Lead whose pipelines and automation run on the infrastructure you own.

About the Role

What You Will Do

Team Leadership & Strategy
Develop and manage a team of 3–5 SRE engineers; establish a culture of operational excellence, ownership, and continuous improvement.
Define the SRE team's technical roadmap: reliability architecture, automation priorities, capacity planning, and on-call model.
Serve as the senior technical escalation for critical incidents — guiding cross-team triage, driving RCA, and ensuring systemic fixes rather than point patches.
Translate operational signals and infrastructure health into clear, actionable narratives for engineering leadership and executive stakeholders.
Partner with hardware and software development teams to understand HPC workload requirements and ensure infrastructure capacity, performance, and reliability meet the needs of silicon and software development programs.

24 x7 Infrastructure Reliability & Observability
Own 24×7 reliability across colocation, on-premises lab clusters, cloud, and customer-facing platform services — designing for failure domains, progressive delivery, and strict change control at every tier.
Own the full observability stack (metrics, traces, logs) and define SLOs/SLIs across all SRE systems; use AI-driven detection, correlation, and guided remediation to reduce time to detect, respond, and resolve.
Evolve incident and problem management into a data-driven discipline: automated triage workflows, AI/analytics to identify recurring patterns, and every P0/P1 producing a written RCA with tracked systemic fixes.
Lead FinOps and capacity planning: model TCO across cloud vs. on-prem vs. colo, drive workload placement decisions, and anticipate infrastructure needs for new silicon programs and customer deployments.
Own infrastructure for customer collaboration environments where partners deploy and validate hardware and software.
Automation & Infrastructure as Code
Drive IaC-first discipline across the team — Terraform, Ansible, and production-quality automation for all infrastructure provisioning and lifecycle management.
Build and mature self-healing infrastructure platforms: host lifecycle automation, fleet auto-remediation, and AIOps-driven alerting that reduce manual intervention across the operational lifecycle.
Documentation & Global Collaboration
Build a documentation culture and scale a follow-the-sun on-call model as we expands globally — runbooks, architecture diagrams, and operational playbooks maintained as living artifacts.
Drive POC and POV evaluations for new infrastructure technologies, interconnect fabrics, and platform services relevant to our accelerator roadmap.

Required

Bachelor's or Master's in Computer Science, Electrical Engineering, or related field; 12 years in SRE, infrastructure engineering, or production engineering (8 years minimum).
3 years managing SRE or infrastructure teams — hiring, growing, and retaining engineers in a fast-moving environment.
Deep Linux systems expertise: networking (TCP/IP, RDMA, bonding), storage, kernel tuning, and bare-metal operations.
Proven experience operating colocation and on-premises hardware at scale: server lifecycle, power and cooling awareness, rack-level networking.
IaC fluency: Terraform and Ansible at production scale — module design, remote state, environment isolation, and change governance.
Kubernetes cluster operations: lifecycle management, workload reliability, storage, and RBAC at scale.
Full observability stack ownership: Prometheus, Grafana, and/or DataDog — SLO definition, alert design, and E2E signal quality.
Strong Python and/or Go — production services, not just scripts; automation that touches real infrastructure safely.
Track record of reducing MTTR/MTTD through automation, workflow orchestration, and AIOps tooling.
Executive communication: translating infrastructure health and operational risk into clear narratives for senior leadership.
Demonstrated track record of moving teams from reactive, process-heavy operations to automated, technology-focused models — not just managing existing runbooks.

Strongly Preferred

Experience operating customer-facing infrastructure or platform services — reliability expectations beyond internal tooling.
Knowledge of high-speed interconnect fabrics: InfiniBand, RoCE, or NVLink — setup, troubleshooting, and performance tuning.
HPC job scheduler experience: Slurm, LSF, or equivalent — setup, tuning, and integration with infrastructure automation.
Multi-cloud hybrid operations: AWS, Azure, GCP alongside on-prem/colo — unified observability and IaC across all tiers.
FinOps: cloud spend attribution, TCO modeling across cloud vs. on-prem vs. colo, and translating cost data into workload placement recommendations for engineering and executive audiences.
ITIL knowledge or equivalent structured incident/problem/change management framework experience.
Published technical writing, conference talks, or open-source contributions in reliability, observability, or HPC infrastructure.

Apply for this job

Receive alerts for other Site Reliability Engineering Manager job openings

Site Reliability Engineering Manager

What are the responsibilities and job description for the Site Reliability Engineering Manager position at Calance?

What is the career path for a Site Reliability Engineering Manager?

Job openings at Calance

Not the job you're looking for? Here are some other Site Reliability Engineering Manager jobs in the Santa Clara, CA area that may be a better fit.

We don't have any other Site Reliability Engineering Manager jobs in the Santa Clara, CA area right now.

AI Assistant is available now!