Demo

Site Reliability Engineer Manager- Hybrid

Calance
santa clara, CA Remote Contractor
POSTED ON 6/17/2026
AVAILABLE BEFORE 7/17/2026
The Role
You will build and lead the Site Reliability Engineering team, owning the infrastructure that development, validation, and customer-facing deployments run on. This spans colocation facilities, on-premises lab clusters, cloud environments (AWS, Azure, Google Cloud Platform), and the platform services customers use to collaborate on hardware and software deployments.

You are both a people manager and a practicing engineer. You will set technical direction, hire and grow the team, own SLOs for critical systems, and be the senior escalation point when things go wrong. You will work closely with hardware and software development teams to ensure HPC infrastructure meets their workload requirements and partner with the Senior DevOps Lead whose pipelines and automation run on the infrastructure you own.

What You Will Do
Team Leadership & Strategy
Develop and manage a team of 3 5 SRE engineers; establish a culture of operational excellence, ownership, and continuous improvement.
Define the SRE team's technical roadmap: reliability architecture, automation priorities, capacity planning, and on-call model.
Serve as the senior technical escalation for critical incidents guiding cross-team triage, driving RCA, and ensuring systemic fixes rather than point patches.
Translate operational signals and infrastructure health into clear, actionable narratives for engineering leadership and executive stakeholders.
Partner with hardware and software development teams to understand HPC workload requirements and ensure infrastructure capacity, performance, and reliability meet the needs of silicon and software development programs.

24 x7 Infrastructure Reliability & Observability
Own 24 7 reliability across colocation, on-premises lab clusters, cloud, and customer-facing platform services designing for failure domains, progressive delivery, and strict change control at every tier.
Own the full observability stack (metrics, traces, logs) and define SLOs/SLIs across all SRE systems; use AI-driven detection, correlation, and guided remediation to reduce time to detect, respond, and resolve.
Evolve incident and problem management into a data-driven discipline: automated triage workflows, AI/analytics to identify recurring patterns, and every P0/P1 producing a written RCA with tracked systemic fixes.
Lead FinOps and capacity planning: model TCO across cloud vs. on-prem vs. colo, drive workload placement decisions, and anticipate infrastructure needs for new silicon programs and customer deployments.
Own infrastructure for customer collaboration environments where partners deploy and validate hardware and software.

Automation & Infrastructure as Code
Drive IaC-first discipline across the team Terraform, Ansible, and production-quality automation for all infrastructure provisioning and lifecycle management.
Build and mature self-healing infrastructure platforms: host lifecycle automation, fleet auto-remediation, and AIOps-driven alerting that reduce manual intervention across the operational lifecycle.

Documentation & Global Collaboration
Build a documentation culture and scale a follow-the-sun on-call model as we expands globally runbooks, architecture diagrams, and operational playbooks maintained as living artifacts.
Drive POC and POV evaluations for new infrastructure technologies, interconnect fabrics, and platform services relevant to our accelerator roadmap.

What You Will Bring
Required
Bachelor's or Master's in Computer Science, Electrical Engineering, or related field; 12 years in SRE, infrastructure engineering, or production engineering (8 years minimum).
3 years managing SRE or infrastructure teams hiring, growing, and retaining engineers in a fast-moving environment.
Deep Linux systems expertise: networking (TCP/IP, RDMA, bonding), storage, kernel tuning, and bare-metal operations.
Proven experience operating colocation and on-premises hardware at scale: server lifecycle, power and cooling awareness, rack-level networking.
IaC fluency: Terraform and Ansible at production scale module design, remote state, environment isolation, and change governance.
Kubernetes cluster operations: lifecycle management, workload reliability, storage, and RBAC at scale.
Full observability stack ownership: Prometheus, Grafana, and/or DataDog SLO definition, alert design, and E2E signal quality.
Strong Python and/or Go production services, not just scripts; automation that touches real infrastructure safely.
Track record of reducing MTTR/MTTD through automation, workflow orchestration, and AIOps tooling.
Executive communication: translating infrastructure health and operational risk into clear narratives for senior leadership.
Demonstrated track record of moving teams from reactive, process-heavy operations to automated, technology-focused models not just managing existing runbooks.

Strongly Preferred
Experience operating customer-facing infrastructure or platform services reliability expectations beyond internal tooling.
Knowledge of high-speed interconnect fabrics: InfiniBand, RoCE, or NVLink setup, troubleshooting, and performance tuning.
HPC job scheduler experience: Slurm, LSF, or equivalent setup, tuning, and integration with infrastructure automation.
Multi-cloud hybrid operations: AWS, Azure, Google Cloud Platform alongside on-prem/colo unified observability and IaC across all tiers.
FinOps: cloud spend attribution, TCO modeling across cloud vs. on-prem vs. colo, and translating cost data into workload placement recommendations for engineering and executive audiences.
ITIL knowledge or equivalent structured incident/problem/change management framework experience.
Published technical writing, conference talks, or open-source contributions in reliability, observability, or HPC infrastructure.

Salary : $90 - $120

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Site Reliability Engineer Manager- Hybrid?

Sign up to receive alerts about other jobs on the Site Reliability Engineer Manager- Hybrid career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$156,679 - $196,968
Income Estimation: 
$222,941 - $284,552
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Calance

  • Calance DENVER, CO
  • CALL US NOW for immediate consideration! Click Apply on Web or Apply Now to view our recruiter s contact info and reach out today, we d love to speak with ... more
  • 1 Day Ago

  • Calance Nashville, TN
  • Technical Analyst Provide tier 3 support for a large enterprise environment, encompassing a large remote user base, and has primary responsibility for clie... more
  • 1 Day Ago

  • Calance Phildelphia, PA
  • Location: Philadelphia, PA, Kubernetes Platform Engineer Voice Services Modernization Position Summary The Kubernetes Platform Engineer Voice Services Mode... more
  • 1 Day Ago

  • Calance Richardson, TX
  • Position Overview: We are seeking an experienced RF Shielded Walk-In Chamber Project Manager to serve as an independent contractor overseeing the full life... more
  • 1 Day Ago


Not the job you're looking for? Here are some other Site Reliability Engineer Manager- Hybrid jobs in the santa clara, CA area that may be a better fit.

  • Candidate Experience site Sunnyvale, CA
  • Join Fortinet, a cybersecurity pioneer with over two decades of excellence, as we continue to shape the future of cybersecurity and redefine the intersecti... more
  • 7 Days Ago

  • Candidate Experience site Sunnyvale, CA
  • We are seeking a talented and motivated Site Reliability Engineer to join our engineering team. You will be responsible for building, maintaining, and trou... more
  • 2 Months Ago

AI Assistant is available now!

Feel free to start your new journey!