Demo

Site Reliability Engineer (AI Acceleration)- Hybrid

Calance US
Santa Clara, CA Other
POSTED ON 6/18/2026
AVAILABLE BEFORE 7/16/2026
We are hiring Site Reliability Engineer (AI Acceleration)- Hybrid for a Contract To Hire position in santa clara, CA

The Role
You will be a core member of the SRE team, responsible for the reliability, automation, and observability of the infrastructure that the company runs on. You will work across colocation, on-premises lab environments, and cloud platforms and you will own your systems end-to-end, from initial provisioning through live incident response.
You will partner with hardware and software development teams to support their workload needs, including CI/CD pipelines and automation layer and the associated CI/CD pipelines and automation layer for software toolling. You will also support customer-facing environments where d-Matrix partners collaborate on hardware and software deployments.

What You Will Do

Infrastructure Operations
Own reliability and availability of assigned infrastructure domains: colo server fleets, on-premises lab clusters, cloud environments (AWS, Azure, GCP), and customer-facing platform services.
Perform hands-on infrastructure work: server provisioning, OS configuration, network setup, storage management, and hardware troubleshooting from bare metal up.
Conduct capacity planning and hardware lifecycle management for assigned infrastructure domains; track and report cloud spend for your domains to support FinOps and workload placement decisions.

Automation & Infrastructure as Code
Own IaC and configuration management (Terraform, Ansible) for your infrastructure domains all provisioning and changes through code, not manual steps.
Build, deploy and document automation to eliminate toil: host lifecycle management, fleet health checks, auto-remediation workflows, and self-service tooling for engineering teams.
Develop networking automations for cluster interconnects, VLAN management, and lab network configurations.
Contribute to shared IaC modules and automation libraries used across the global SRE and data center services teams.
Observability & Incident Response
Design and maintain monitoring dashboards, alerting, and SLIs (Prometheus/Grafana, DataDog) for your infrastructure domains ensuring signal quality, actionable alerts, and contributing to AIOps-driven detection workflows that reduce time to detect and respond.
Participate in on-call rotation; triage and resolve incidents from bare metal to application layer using structured, AI-assisted workflows distinguishing infrastructure faults from software or hardware product issues and escalating with clear context.
Produce high-quality RCA reports for P0/P1 incidents with root cause analysis and tracked action items.
Detect performance issues, recommend solutions, and implement fixes that permanently improve system reliability.

Customer & Platform Development Services
Support and operate platform services used by both internal teams and external customers for hardware and software deployment collaboration with d-Matrix.
Ensure QoS and uptime commitments for customer-facing environments; escalate reliability risks proactively.
Document platform configurations, access procedures, and operational runbooks for customer environments.

Documentation & Collaboration
Maintain high-quality runbooks, architecture diagrams, and troubleshooting guides documentation is part of the job, not an afterthought.
Partner with the DevOps team to ensure infrastructure reliability supports CI/CD pipeline performance and developer experience.
Serve as a technical resource for engineering teams sharing operational knowledge and raising infrastructure risks early.

What You Will Bring

Required
Bachelor's or Master's in Computer Science, Electrical Engineering, or related field (or equivalent experience); 5 years in SRE, infrastructure engineering, or systems administration.
Strong Linux systems knowledge: networking, storage, systemd, package management, kernel parameters, and performance diagnostics.
Hands-on experience with colocation or on-premises server infrastructure physical hardware, rack networking, and bare-metal provisioning.
Hands-on experience deploying and operating AI-driven infrastructure tools AIOps platforms, intelligent alerting, anomaly detection, or LLM-assisted diagnostics in production environments.
IaC experience with Terraform and/or Ansible writing and maintaining production configurations, not just running existing playbooks.
Kubernetes operational experience: cluster troubleshooting, workload management, storage, and networking.
Prometheus Grafana or DataDog: building dashboards, writing alert rules, and understanding signal quality.
Python and/or Bash scripting: production-quality automation, not just one-off scripts.
Incident response experience: structured triage, RCA production, and follow-through on action items.
Comfort operating in fast-moving startups: you own your systems, document what you build, and iterate without waiting for perfect requirements.

Strongly Preferred
Experience operating customer-facing infrastructure or platform services with external reliability expectations.
Cloud infrastructure operations across AWS, Azure, or GCP including hybrid environments spanning cloud and on-prem.
HPC job scheduler experience: Slurm, LSF, or equivalent operations and troubleshooting.
Knowledge of high-speed interconnect fabrics: InfiniBand, RoCE, or NVLink configuration and troubleshooting.
Experience with large-scale infrastructure automation: host lifecycle management, fleet auto-healing, or AIOps-driven operations building tooling that reduces manual intervention, not just running it

Estimated Pay Range: 80-100/hr

Hourly Wage Estimation for Site Reliability Engineer (AI Acceleration)- Hybrid in Santa Clara, CA
$65.00 to $86.00
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Calance US

  • Calance US Marysville, OH
  • We are hiring Endpoint Systems Administrator (SCCM/MECM) for a Contract position in Marysville, OH CALL US NOW for immediate consideration! Click Apply on ... more
  • 14 Days Ago

  • Calance US Denver, CO
  • We are hiring M365 Systems Administrator - Candidates must be local to Denver, CO or Brentwood, TN. Must go onsite during the 3 days per month onsite (Tues... more
  • 16 Days Ago

  • Calance US Buffalo, WV
  • We are hiring Deskside IT Support Technician-Level III (Mfg/Infrastructure/Automation) for a Contract position in Buffalo, WV CALL US NOW for immediate con... more
  • 5 Days Ago

  • Calance US Denver, CO
  • We are hiring JAVA Developer (Local candidates only) for a Contract position in DENVER, CO CALL US NOW for immediate consideration! Click Apply on Web or A... more
  • 6 Days Ago


Not the job you're looking for? Here are some other Site Reliability Engineer (AI Acceleration)- Hybrid jobs in the Santa Clara, CA area that may be a better fit.

  • Calance santa clara, CA
  • The Role You will build and lead the Site Reliability Engineering team, owning the infrastructure that development, validation, and customer-facing deploym... more
  • 6 Days Ago

  • Candidate Experience site Sunnyvale, CA
  • Join Fortinet, a cybersecurity pioneer with over two decades of excellence, as we continue to shape the future of cybersecurity and redefine the intersecti... more
  • 7 Days Ago

AI Assistant is available now!

Feel free to start your new journey!