What are the responsibilities and job description for the Mid-Level Software Engineer, AI Reliability Engineering position at Jobright.ai?
Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a staffing agency. Jobright does not hire directly for these positions. We connect you with verified openings from employers you can trust.
Job Summary:
Anthropic is a public benefit corporation focused on creating reliable and beneficial AI systems. They are seeking a Software Engineer in AI Reliability Engineering to develop reliability metrics and improve the reliability of their AI services, while also leveraging modern AI capabilities to enhance operational processes.
Responsibilities:
• Develop appropriate Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity.
• Design and implement monitoring systems including availability, latency and other salient metrics.
• Assist in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of millions of external customers and high-traffic internal workloads.
• Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.
• Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
• Build and maintain cost optimization systems for large-scale AI infrastructure, focusing on accelerator (GPU/TPU/Trainium) utilization and efficiency
Qualifications:
Required:
• Bachelor's degree in a related field or equivalent experience
• Extensive experience with distributed systems observability and monitoring at scale
• Understanding of the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
• Proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
• Comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)
• Experience with chaos engineering and systematic resilience testing
• Ability to effectively bridge the gap between ML engineers and infrastructure teams
• Excellent communication skills
Preferred:
• Experience operating large-scale model training infrastructure or serving infrastructure (>1000 GPUs)
• Experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium, e.g.)
• Understanding of ML-specific networking optimizations like RDMA and InfiniBand
• Expertise in AI-specific observability tools and frameworks
• Understanding of ML model deployment strategies and their reliability implications
• Contributed to open-source infrastructure or ML tooling
Company:
Anthropic is an AI research company that focuses on the safety and alignment of AI systems with human values. Founded in 2021, the company is headquartered in San Francisco, California, USA, with a team of 501-1000 employees. The company is currently Late Stage. Anthropic has a track record of offering H1B sponsorships.