What are the responsibilities and job description for the Senior LLM Training Resilience Engineer position at Jobright.ai?

Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a staffing agency. Jobright does not hire directly for these positions. We connect you with verified openings from employers you can trust.

Job Summary:

Together AI is a research-driven artificial intelligence company focused on developing open-source generative AI and infrastructure for AI models. They are seeking a Large-scale Training Resilience Engineer to ensure the reliability and scalability of their large-scale training infrastructure, addressing complex distributed systems problems and building highly available AI training pipelines.

Responsibilities:

• Develop systems to identify, isolate, and recover from failures in large-scale distributed training workloads.

• Implement proactive error-detection mechanisms, including straggler detection and fault prediction algorithms.

• Ensure stability and consistency across distributed training clusters (e.g., GPU/TPU clusters).

• Optimize recovery time and throughput in the face of hardware or software failures.

• Design and maintain observability systems for monitoring cluster health, training performance, and failure patterns.

• Leverage telemetry data to improve incident response and automate mitigation strategies.

• Build resilience-focused tooling, such as job health monitors, distributed checkpoint systems, and automated recovery workflows.

• Enhance debugging and diagnosis frameworks for distributed training jobs.

• Collaborate with platform engineers, researchers, and ML practitioners to identify pain points and resilience requirements.

• Document and communicate best practices for fault-tolerant AI training.

Qualifications:

Required:

• 5 years of experience in distributed systems, cloud infrastructure, or large-scale machine learning training.

• Proficiency in distributed computing frameworks (e.g., PyTorch DDP, TensorFlow, Horovod).

• Strong knowledge of resilience strategies in distributed systems (e.g., leader election, consensus, retry mechanisms).

• Hands-on experience with observability tools (e.g., Prometheus, Grafana, ELK stack).

• Proficient in Python, Go, or a similar programming language.

• Experience working with cloud platforms (e.g., AWS, GCP, Azure) and Kubernetes for workload orchestration.

• Strong analytical, problem-solving, and debugging skills.

• Excellent collaboration and communication skills.

Preferred:

• Familiarity with GPU/TPU cluster management and scheduling.

• Experience with high-availability database systems or message queues.

• Experience with open-source contributions or community engagement.

Company:

Together AI is a cloud-based platform designed for constructing open-source generative AI and infrastructure for developing AI models. Founded in 2022, the company is headquartered in San Francisco, California, USA, with a team of 201-500 employees. The company is currently Growth Stage. Together AI has a track record of offering H1B sponsorships.

Apply for this job

Receive alerts for other Senior LLM Training Resilience Engineer job openings

Job openings at Jobright.ai

Senior Business/Data Analyst

Jobright.ai

Washington, DC Full Time

Verified Job On Employer Career Site Job Summary: International Logic Systems, Inc. is seeking a Senior Business/Data An...

Senior Data Scientist (DoD - Secret Clearance)

Jobright.ai

Washington, DC Full Time

Verified Job On Employer Career Site Job Summary: Aalis Management Consulting is an 8(a) certified Service-Disabled Vete...

Senior PMS 377 LHA/LPD C5ISR Project Manager

Jobright.ai

Washington, DC Full Time

Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a ...

Data Engineer, Mid-Level

Jobright.ai

Washington, DC Full Time

Verified Job On Employer Career Site Job Summary: Booz Allen Hamilton is a leading technology firm focused on utilizing ...

Not the job you're looking for? Here are some other Senior LLM Training Resilience Engineer jobs in the San Francisco, CA area that may be a better fit.

Senior LLM Training Resilience Engineer

What are the responsibilities and job description for the Senior LLM Training Resilience Engineer position at Jobright.ai?

What is the career path for a Senior LLM Training Resilience Engineer?

Job openings at Jobright.ai

Not the job you're looking for? Here are some other Senior LLM Training Resilience Engineer jobs in the San Francisco, CA area that may be a better fit.

We don't have any other Senior LLM Training Resilience Engineer jobs in the San Francisco, CA area right now.

AI Assistant is available now!