Demo

Mid-Level Software Engineer, AI Reliability Engineering

Jobright.ai
San Francisco, CA Full Time
POSTED ON 11/3/2025
AVAILABLE BEFORE 12/2/2025

Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a staffing agency. Jobright does not hire directly for these positions. We connect you with verified openings from employers you can trust.


Job Summary:

Anthropic is a public benefit corporation focused on creating reliable and beneficial AI systems. They are seeking a Software Engineer in AI Reliability Engineering to develop reliability metrics and improve the reliability of their AI services, while also leveraging modern AI capabilities to enhance operational processes.


Responsibilities:

• Develop appropriate Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity.

• Design and implement monitoring systems including availability, latency and other salient metrics.

• Assist in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of millions of external customers and high-traffic internal workloads.

• Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.

• Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident

• Build and maintain cost optimization systems for large-scale AI infrastructure, focusing on accelerator (GPU/TPU/Trainium) utilization and efficiency


Qualifications:


Required:

• Bachelor's degree in a related field or equivalent experience

• Extensive experience with distributed systems observability and monitoring at scale

• Understanding of the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines

• Proven experience implementing and maintaining SLO/SLA frameworks for business-critical services

• Comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)

• Experience with chaos engineering and systematic resilience testing

• Ability to effectively bridge the gap between ML engineers and infrastructure teams

• Excellent communication skills


Preferred:

• Experience operating large-scale model training infrastructure or serving infrastructure (>1000 GPUs)

• Experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium, e.g.)

• Understanding of ML-specific networking optimizations like RDMA and InfiniBand

• Expertise in AI-specific observability tools and frameworks

• Understanding of ML model deployment strategies and their reliability implications

• Contributed to open-source infrastructure or ML tooling


Company:

Anthropic is an AI research company that focuses on the safety and alignment of AI systems with human values. Founded in 2021, the company is headquartered in San Francisco, California, USA, with a team of 501-1000 employees. The company is currently Late Stage. Anthropic has a track record of offering H1B sponsorships.

Salary.com Estimation for Mid-Level Software Engineer, AI Reliability Engineering in San Francisco, CA
$134,227 to $161,312
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Mid-Level Software Engineer, AI Reliability Engineering?

Sign up to receive alerts about other jobs on the Mid-Level Software Engineer, AI Reliability Engineering career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$97,257 - $120,701
Income Estimation: 
$123,167 - $152,295
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$120,933 - $155,034
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$123,167 - $152,295
Income Estimation: 
$146,673 - $180,130
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Jobright.ai

Jobright.ai
Hired Organization Address Washington, DC Full Time
Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a ...
Jobright.ai
Hired Organization Address Washington, DC Full Time
Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a ...
Jobright.ai
Hired Organization Address Salt Lake, UT Full Time
Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a ...
Jobright.ai
Hired Organization Address South Jordan, UT Full Time
Jobright.ai is an AI-powered career platform that helps you discover verified jobs directly from employer sites across t...

Not the job you're looking for? Here are some other Mid-Level Software Engineer, AI Reliability Engineering jobs in the San Francisco, CA area that may be a better fit.

Backend Software Engineer, AI Platform, Mid-Level

Jobright.ai, San Francisco, CA

Backend Software Engineer, AI Platform, Mid-Level

Jobright.ai, San Francisco, CA

AI Assistant is available now!

Feel free to start your new journey!