Demo

Senior LLM Training Resilience Engineer

Jobright.ai
San Francisco, CA Full Time
POSTED ON 9/25/2025
AVAILABLE BEFORE 10/22/2025

Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a staffing agency. Jobright does not hire directly for these positions. We connect you with verified openings from employers you can trust.


Job Summary:

Together AI is a research-driven artificial intelligence company focused on developing open-source generative AI and infrastructure for AI models. They are seeking a Large-scale Training Resilience Engineer to ensure the reliability and scalability of their large-scale training infrastructure, addressing complex distributed systems problems and building highly available AI training pipelines.


Responsibilities:

• Develop systems to identify, isolate, and recover from failures in large-scale distributed training workloads.

• Implement proactive error-detection mechanisms, including straggler detection and fault prediction algorithms.

• Ensure stability and consistency across distributed training clusters (e.g., GPU/TPU clusters).

• Optimize recovery time and throughput in the face of hardware or software failures.

• Design and maintain observability systems for monitoring cluster health, training performance, and failure patterns.

• Leverage telemetry data to improve incident response and automate mitigation strategies.

• Build resilience-focused tooling, such as job health monitors, distributed checkpoint systems, and automated recovery workflows.

• Enhance debugging and diagnosis frameworks for distributed training jobs.

• Collaborate with platform engineers, researchers, and ML practitioners to identify pain points and resilience requirements.

• Document and communicate best practices for fault-tolerant AI training.


Qualifications:


Required:

• 5 years of experience in distributed systems, cloud infrastructure, or large-scale machine learning training.

• Proficiency in distributed computing frameworks (e.g., PyTorch DDP, TensorFlow, Horovod).

• Strong knowledge of resilience strategies in distributed systems (e.g., leader election, consensus, retry mechanisms).

• Hands-on experience with observability tools (e.g., Prometheus, Grafana, ELK stack).

• Proficient in Python, Go, or a similar programming language.

• Experience working with cloud platforms (e.g., AWS, GCP, Azure) and Kubernetes for workload orchestration.

• Strong analytical, problem-solving, and debugging skills.

• Excellent collaboration and communication skills.


Preferred:

• Familiarity with GPU/TPU cluster management and scheduling.

• Experience with high-availability database systems or message queues.

• Experience with open-source contributions or community engagement.


Company:

Together AI is a cloud-based platform designed for constructing open-source generative AI and infrastructure for developing AI models. Founded in 2022, the company is headquartered in San Francisco, California, USA, with a team of 201-500 employees. The company is currently Growth Stage. Together AI has a track record of offering H1B sponsorships.

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Senior LLM Training Resilience Engineer?

Sign up to receive alerts about other jobs on the Senior LLM Training Resilience Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$85,996 - $102,718
Income Estimation: 
$111,859 - $131,446
Income Estimation: 
$110,457 - $133,106
Income Estimation: 
$105,809 - $128,724
Income Estimation: 
$122,763 - $145,698
Income Estimation: 
$70,310 - $88,223
Income Estimation: 
$88,950 - $110,401
Income Estimation: 
$84,958 - $111,603
Income Estimation: 
$88,950 - $110,401
Income Estimation: 
$109,186 - $139,009
Income Estimation: 
$115,336 - $159,446
Income Estimation: 
$110,730 - $135,754
Income Estimation: 
$128,617 - $162,576
Income Estimation: 
$117,033 - $148,289
Income Estimation: 
$129,363 - $167,316
Income Estimation: 
$145,845 - $177,256
Income Estimation: 
$147,836 - $182,130
Income Estimation: 
$154,597 - $194,610
Income Estimation: 
$86,891 - $130,303
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Jobright.ai

Jobright.ai
Hired Organization Address Washington, DC Full Time
Verified Job On Employer Career Site Job Summary: International Logic Systems, Inc. is seeking a Senior Business/Data An...
Jobright.ai
Hired Organization Address Washington, DC Full Time
Verified Job On Employer Career Site Job Summary: Aalis Management Consulting is an 8(a) certified Service-Disabled Vete...
Jobright.ai
Hired Organization Address Washington, DC Full Time
Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a ...
Jobright.ai
Hired Organization Address Washington, DC Full Time
Verified Job On Employer Career Site Job Summary: Booz Allen Hamilton is a leading technology firm focused on utilizing ...

Not the job you're looking for? Here are some other Senior LLM Training Resilience Engineer jobs in the San Francisco, CA area that may be a better fit.

LLM Training Resilience Engineer

Together AI, San Francisco, CA

LLM Training Resilience Engineer

togetherai, San Francisco, CA

AI Assistant is available now!

Feel free to start your new journey!