Demo

Senior Site Reliability Engineer GPU Infrastructure

ExecutivePlacements.com
San Francisco, CA Full Time
POSTED ON 11/3/2025
AVAILABLE BEFORE 12/2/2025
Join Genmo

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

What You'll Do

  • Own the design and day?to?day operation of GPU clusters that train and serve frontier generative models.
  • Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi?cluster federation.
  • Define and implement Infrastructure?as?Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.
  • Build CI/CD pipelines, automated testing, and rollout strategies for infra changes.
  • Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.
  • Optimize high?performance networking (InfiniBand/RDMA) and debug perf bottlenecks.
  • Run and continuously improve the 247 on?call rotation; lead post?incident reviews.
  • Partner with researchers and engineers, communicate crisply, and ship with a high?ownership mindset.

Minimum Qualifications

  • BS/MS/PhD in CS, EE, or related field.
  • 3 yrs SRE/DevOps in production; 2 yrs managing large Kubernetes fleets.
  • Expert?level Kubernetes experience.
  • Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible).
  • Track record of shipping and operating large?scale infrastructure with high reliability and clear communication.

Nice to Have

  • Multi?cluster / multi?cloud (AWS, GCP, Azure, bare?metal) production experience.
  • Hands?on with containerized GPU stacks (nvidia?container?toolkit, GPU Operator)
  • GPU schedulers such as Slurm or Kueue.
  • Familiarity with CI/CD tooling (GitHub Actions, BuildKit).
  • Prior work with distributed training, model?serving patterns, or other ML/GPU workloads.

Machine?learning depth is a plusnot a prerequisite. We'll help you level up if needed.

Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish.

Salary.com Estimation for Senior Site Reliability Engineer GPU Infrastructure in San Francisco, CA
$130,877 to $153,406
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Senior Site Reliability Engineer GPU Infrastructure?

Sign up to receive alerts about other jobs on the Senior Site Reliability Engineer GPU Infrastructure career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$151,875 - $212,356
Income Estimation: 
$169,957 - $202,398
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$120,933 - $155,034
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$76,670 - $90,826
Income Estimation: 
$91,609 - $118,978
Income Estimation: 
$92,877 - $110,401
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at ExecutivePlacements.com

ExecutivePlacements.com
Hired Organization Address Alaska, AK Full Time
DevSecOps Architect - Application Security & Software Supply Chain Remote Visa- Open 6 Months Key Responsibilities Desig...
ExecutivePlacements.com
Hired Organization Address Providence, RI Full Time
HRIS Analyst Job Posting The HRIS Analyst is a support level position within the HRIS organization. This position is res...
ExecutivePlacements.com
Hired Organization Address Wilmington, DE Intern
Capital One Software is seeking a Senior Manager, Data Engineering who is passionate about marrying innovation with emer...
ExecutivePlacements.com
Hired Organization Address Wilmington, DE Part Time
Ever since our first credit card customer in 1994, Capital One has recognized that technology and data can enable even l...

Not the job you're looking for? Here are some other Senior Site Reliability Engineer GPU Infrastructure jobs in the San Francisco, CA area that may be a better fit.

AI Assistant is available now!

Feel free to start your new journey!