Demo

Site Reliability Engineer (Sonoma)

ExecutivePlacements.com
Sonoma, CA Full Time
POSTED ON 12/24/2025
AVAILABLE BEFORE 1/23/2026
Senior Platform Engineer/Site Reliability Engineer AI Infrastructure

Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, youll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring

seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. Aswell as supporting their extremely exciting new products coming to the market!

This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.

If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Required Skills & Experience

  • Customer facing experience and the attitude to be a Swiss army knife!
  • Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Salary.com Estimation for Site Reliability Engineer (Sonoma) in Sonoma, CA
$104,429 to $130,246
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Site Reliability Engineer (Sonoma)?

Sign up to receive alerts about other jobs on the Site Reliability Engineer (Sonoma) career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$76,670 - $90,826
Income Estimation: 
$91,609 - $118,978
Income Estimation: 
$92,877 - $110,401
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at ExecutivePlacements.com

  • ExecutivePlacements.com Fargo, ND
  • Join to apply for the Remote Customer Success Representative role at Forum Communications Co. Get AI-powered advice on this job and more exclusive features... more
  • 13 Days Ago

  • ExecutivePlacements.com Fargo, ND
  • Job Description This role combines flexibility with leadership responsibility. You will oversee team production and development. Bonuses reward consistency... more
  • 13 Days Ago

  • ExecutivePlacements.com Anchorage, AK
  • About The Job Remote Office Data Entry Specialist We strive to operate in an environmentally sustainable manner and promote land-based environmental progra... more
  • 13 Days Ago

  • ExecutivePlacements.com St Marys, AK
  • Seeking Motivated Individuals For Data Entry Type Work From Home Our company is seeking applicants who are motivated to work from home and participate in p... more
  • 13 Days Ago


Not the job you're looking for? Here are some other Site Reliability Engineer (Sonoma) jobs in the Sonoma, CA area that may be a better fit.

  • The Walt Disney Company Nicasio, CA
  • The Skywalker Sound Development Group is seeking a skilled Sr System Reliability Engineer to join our team. The Skysound Development Group is developing a ... more
  • 10 Days Ago

  • ATS - Industrial Maintenance Yountville, CA
  • About the rolePicture your week: you start by walking the floor to validate job sites, confirming what a task will really take before the first wrench turn... more
  • 1 Month Ago

AI Assistant is available now!

Feel free to start your new journey!