Demo

Site Reliability Engineering Manager

Calance
Santa Clara, CA Full Time
POSTED ON 6/22/2026 CLOSED ON 6/25/2026

What are the responsibilities and job description for the Site Reliability Engineering Manager position at Calance?

The RoleYou will build and lead the Site Reliability Engineering team, owning the infrastructure that development, validation, and customer-facing deployments run on. This spans colocation facilities, on-premises lab clusters, cloud environments (AWS, Azure, GCP), and the platform services customers use to collaborate on hardware and software deployments.


You are both a people manager and a practicing engineer. You will set technical direction, hire and grow the team, own SLOs for critical systems, and be the senior escalation point when things go wrong. You will work closely with hardware and software development teams to ensure HPC infrastructure meets their workload requirements — and partner with the Senior DevOps Lead whose pipelines and automation run on the infrastructure you own.


About the Role

What You Will Do

  • Team Leadership & Strategy
  • Develop and manage a team of 3–5 SRE engineers; establish a culture of operational excellence, ownership, and continuous improvement.
  • Define the SRE team's technical roadmap: reliability architecture, automation priorities, capacity planning, and on-call model.
  • Serve as the senior technical escalation for critical incidents — guiding cross-team triage, driving RCA, and ensuring systemic fixes rather than point patches.
  • Translate operational signals and infrastructure health into clear, actionable narratives for engineering leadership and executive stakeholders.
  • Partner with hardware and software development teams to understand HPC workload requirements and ensure infrastructure capacity, performance, and reliability meet the needs of silicon and software development programs.


  • 24 x7 Infrastructure Reliability & Observability
  • Own 24×7 reliability across colocation, on-premises lab clusters, cloud, and customer-facing platform services — designing for failure domains, progressive delivery, and strict change control at every tier.
  • Own the full observability stack (metrics, traces, logs) and define SLOs/SLIs across all SRE systems; use AI-driven detection, correlation, and guided remediation to reduce time to detect, respond, and resolve.
  • Evolve incident and problem management into a data-driven discipline: automated triage workflows, AI/analytics to identify recurring patterns, and every P0/P1 producing a written RCA with tracked systemic fixes.
  • Lead FinOps and capacity planning: model TCO across cloud vs. on-prem vs. colo, drive workload placement decisions, and anticipate infrastructure needs for new silicon programs and customer deployments.
  • Own infrastructure for customer collaboration environments where partners deploy and validate hardware and software.
  • Automation & Infrastructure as Code
  • Drive IaC-first discipline across the team — Terraform, Ansible, and production-quality automation for all infrastructure provisioning and lifecycle management.
  • Build and mature self-healing infrastructure platforms: host lifecycle automation, fleet auto-remediation, and AIOps-driven alerting that reduce manual intervention across the operational lifecycle.
  • Documentation & Global Collaboration
  • Build a documentation culture and scale a follow-the-sun on-call model as we expands globally — runbooks, architecture diagrams, and operational playbooks maintained as living artifacts.
  • Drive POC and POV evaluations for new infrastructure technologies, interconnect fabrics, and platform services relevant to our accelerator roadmap.


Required

  • Bachelor's or Master's in Computer Science, Electrical Engineering, or related field; 12 years in SRE, infrastructure engineering, or production engineering (8 years minimum).
  • 3 years managing SRE or infrastructure teams — hiring, growing, and retaining engineers in a fast-moving environment.
  • Deep Linux systems expertise: networking (TCP/IP, RDMA, bonding), storage, kernel tuning, and bare-metal operations.
  • Proven experience operating colocation and on-premises hardware at scale: server lifecycle, power and cooling awareness, rack-level networking.
  • IaC fluency: Terraform and Ansible at production scale — module design, remote state, environment isolation, and change governance.
  • Kubernetes cluster operations: lifecycle management, workload reliability, storage, and RBAC at scale.
  • Full observability stack ownership: Prometheus, Grafana, and/or DataDog — SLO definition, alert design, and E2E signal quality.
  • Strong Python and/or Go — production services, not just scripts; automation that touches real infrastructure safely.
  • Track record of reducing MTTR/MTTD through automation, workflow orchestration, and AIOps tooling.
  • Executive communication: translating infrastructure health and operational risk into clear narratives for senior leadership.
  • Demonstrated track record of moving teams from reactive, process-heavy operations to automated, technology-focused models — not just managing existing runbooks.


Strongly Preferred

  • Experience operating customer-facing infrastructure or platform services — reliability expectations beyond internal tooling.
  • Knowledge of high-speed interconnect fabrics: InfiniBand, RoCE, or NVLink — setup, troubleshooting, and performance tuning.
  • HPC job scheduler experience: Slurm, LSF, or equivalent — setup, tuning, and integration with infrastructure automation.
  • Multi-cloud hybrid operations: AWS, Azure, GCP alongside on-prem/colo — unified observability and IaC across all tiers.
  • FinOps: cloud spend attribution, TCO modeling across cloud vs. on-prem vs. colo, and translating cost data into workload placement recommendations for engineering and executive audiences.
  • ITIL knowledge or equivalent structured incident/problem/change management framework experience.
  • Published technical writing, conference talks, or open-source contributions in reliability, observability, or HPC infrastructure.


Salary.com Estimation for Site Reliability Engineering Manager in Santa Clara, CA
$194,254 to $233,857
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Site Reliability Engineering Manager?

Sign up to receive alerts about other jobs on the Site Reliability Engineering Manager career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$156,679 - $196,968
Income Estimation: 
$222,941 - $284,552
This job has expired.
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Calance

  • Calance washington, DC
  • Reporting to the Senior Manager of Enterprise Systems Infrastructure, this role partners with IT engineering teams to design, implement, and evolve the fir... more
  • 1 Day Ago

  • Calance DENVER, CO
  • CALL US NOW for immediate consideration! Click Apply on Web or Apply Now to view our recruiter s contact info and reach out today, we d love to speak with ... more
  • 1 Day Ago

  • Calance DENVER, CO
  • CALL US NOW for immediate consideration! Click Apply on Web or Apply Now to view our recruiter s contact info and reach out today, we d love to speak with ... more
  • 1 Day Ago

  • Calance Dallas, TX
  • Job Description: CF Engineer (Configuration/Control File Engineer). The CF Engineer will be responsible for generating, validating, and implementing networ... more
  • 1 Day Ago


Not the job you're looking for? Here are some other Site Reliability Engineering Manager jobs in the Santa Clara, CA area that may be a better fit.

  • Cisco Los Altos, CA
  • The application window is expected to close on: 03/31/2026 Job posting may be removed earlier if the position is filled or if a sufficient number of applic... more
  • 1 Day Ago

  • hackajob Palo Alto, CA
  • hackajob is collaborating with J.P. Morgan to connect them with exceptional professionals for this role. Job Description You have discovered the perfect se... more
  • 3 Days Ago

AI Assistant is available now!

Feel free to start your new journey!