Demo

Site Reliability Engineer, AI/ML Infrastructure

Boson AI
Santa Clara, CA Full Time
POSTED ON 11/14/2025
AVAILABLE BEFORE 1/14/2026
About The RoleWe're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers.You'll be hands-on with the full lifecycle of HPC infrastructure: planning, building, testing, deploying, and keeping everything running smoothly. That means troubleshooting issues as they arise, monitoring performance, developing automation to make our lives easier, and working closely with engineering and science teams to ensure they have what they need. You'll also help us plan for future capacity and evaluate new technologies as we continue to scale.ResponsibilitiesManage and optimize HPC cluster operationsDeploy and maintain infrastructure-as-code solutionsSupport ML/research teams with cluster usage optimizationOperate, troubleshoot and optimize Ceph storage clusters.Develop automation and toolingMinimum Qualifications5 years of experience in SRE or HPC operations.Proficiency in Linux systems administration (Ubuntu/Debian).Experience with Kubernetes and container orchestrationExperience with Ceph >1PB deployments and maintenanceKnowledge of security best practices in multi-tenant environments.Understanding of L2/L3 networking fundamentalsSkilled in Python and Bash scripting.Preferred QualificationsExperience with infrastructure-as-code tools (Ansible/Terraform).Experience with GitOps (Helm, ArgoCD).Strong grasp of RDMA, InfiniBand, and GPUDirect technologiesFamiliarity with deep learning frameworks such as PyTorch and TensorFlow.Familiarity in at least one cloud platform: AWS, Azure or GCP.If you're a natural problem-solver with a passion for continuous learning, we'd love to hear from you.We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Salary.com Estimation for Site Reliability Engineer, AI/ML Infrastructure in Santa Clara, CA
$122,782 to $144,617
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Site Reliability Engineer, AI/ML Infrastructure?

Sign up to receive alerts about other jobs on the Site Reliability Engineer, AI/ML Infrastructure career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$169,957 - $202,398
Income Estimation: 
$151,875 - $212,356
Income Estimation: 
$120,143 - $165,703
Income Estimation: 
$76,670 - $90,826
Income Estimation: 
$91,609 - $118,978
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$120,933 - $155,034
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Not the job you're looking for? Here are some other Site Reliability Engineer, AI/ML Infrastructure jobs in the Santa Clara, CA area that may be a better fit.

  • Luma AI Palo Alto, CA
  • The Opportunity Luma AI is building the engine for multimodal general intelligence. To teach models to understand the world through video, audio, and image... more
  • 15 Days Ago

  • Archetype AI Palo Alto, CA
  • About Archetype AI Archetype AI is developing the world's first AI platform to bring AI into the real world. Formed by an exceptionally high-caliber team f... more
  • 2 Days Ago

AI Assistant is available now!

Feel free to start your new journey!