Demo

DevOps / Site Reliability Engineer

Reactor
San Francisco, CA Full Time
POSTED ON 5/11/2026
AVAILABLE BEFORE 11/4/2026

Department: Engineering

Location: San Francisco

Description

We're looking for a DevOps / SRE engineer to own the reliability, delivery, and observability of our AI platform. You'll be the person who ensures models get from a developer's branch to production without anyone losing sleep — and when something does go wrong at 2am, you'll be the one who knows where to look.

We run production across multiple Kubernetes clusters, cloud providers, and regions. Our deployment pipeline is fully automated through CI/CD and GitOps, our infrastructure is managed as code, and our observability stack gives us full visibility across every service and GPU workload. This role is about making all of that faster, more reliable, and easier to operate as we scale.

What You'll Do

  • Own and evolve our CI/CD pipelines: dynamic pipeline generation across a monorepo of Go services, Python model containers, and Helm charts
  • Operate and improve our GitOps deployment lifecycle: Helm releases, Kustomizations, and image automation across multiple clusters
  • Build and maintain our observability stack: distributed tracing, metrics, dashboards, and alerting across all services and GPU workloads
  • Define and track SLOs for core platform services, including session latency, model cold start time, and streaming reliability
  • Run incident response: triage production issues, write postmortems, build runbooks, and drive reliability improvements
  • Manage infrastructure-as-code across multiple cloud providers and regions: plan/apply workflows, state management, drift detection
  • Operate secret management: encrypted secrets, external secret syncing, certificate automation
  • Improve deployment safety: canary rollouts, health checks, startup probes, rollback automation
  • Manage authentication infrastructure: OIDC federation for CI, workload identity for cloud services, cross-cloud credential management
  • Participate in on-call rotation and build the tooling that makes on-call less painful

What We're Looking For

  • You've run production Kubernetes clusters and been on-call for them. You've debugged node scheduling failures, OOM kills, and mysterious pod evictions at 3am
  • Strong CI/CD experience: you've built and maintained pipelines for monorepos, not just single-service repos
  • GitOps experience: you understand reconciliation loops, drift detection, and why image automation matters
  • Infrastructure-as-code fluency with Terraform or similar across multiple environments and cloud accounts
  • You know observability beyond just "set up dashboards". You've defined SLOs, built alerting that doesn't page on noise, and used traces to debug cross-service latency issues.
  • Comfortable with secret management patterns (KMS, encrypted configs, external secret operators). You've thought about credential rotation and zero-trust.
  • Incident response experience: you've triaged production outages, written postmortems that actually led to improvements, and built runbooks that other engineers could follow
  • You write code, not just YAML. Proficiency in Go, Python, or Bash for building tooling, automation, and pipeline scripts

Nice to Have
  • Experience with GPU workloads on Kubernetes: device plugins, GPU-aware scheduling, GPU monitoring
  • Multi-cloud operations beyond a single provider
  • Real-time or streaming workloads: low-latency systems where p99 matters more than average
  • Experience with Helm chart authoring and managing complex value layering across environments
  • Familiarity with real-time media or relay infrastructure
  • FinOps experience: GPU cost optimization, spot/preemptible instance management

What We're Not Looking For
  • Engineers who treat infrastructure-as-code as "click around in the console and import later"
  • SREs who've only monitored systems but never built the deployment pipelines that ship to them
  • Candidates whose CI/CD experience is limited to GitHub Actions for a single-service repo
  • People who write alerts that fire every day and then get ignored

Logistics
We are based in-person in San Francisco. We are also hiring for this role in Europe for on-call coverage and timezone distribution.

Benefits

  • Competitive salary and meaningful early equity
  • Visa sponsorship and relocation support
  • Generous health, dental, and vision coverage

Salary.com Estimation for DevOps / Site Reliability Engineer in San Francisco, CA
$127,676 to $149,228
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a DevOps / Site Reliability Engineer?

Sign up to receive alerts about other jobs on the DevOps / Site Reliability Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$92,369 - $122,605
Income Estimation: 
$117,024 - $149,811
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Not the job you're looking for? Here are some other DevOps / Site Reliability Engineer jobs in the San Francisco, CA area that may be a better fit.

  • Blaxel (YC X25) San Francisco, CA
  • The role We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance, and scalability of our AI infrastructure platfor... more
  • 17 Days Ago

  • TELCOR Inc San Francisco, CA
  • TELCOR Inc, a leading innovator in laboratory software, is looking for a Site Reliability Engineer to join our TELCOR AI Systems team! Do you have strong e... more
  • 17 Days Ago

AI Assistant is available now!

Feel free to start your new journey!