What are the responsibilities and job description for the Senior LLM Training Resilience Engineer position at Jobright.ai?
Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a staffing agency. Jobright does not hire directly for these positions. We connect you with verified openings from employers you can trust.
Job Summary:
Together AI is a research-driven artificial intelligence company focused on developing open-source generative AI and infrastructure for AI models. They are seeking a Large-scale Training Resilience Engineer to ensure the reliability and scalability of their large-scale training infrastructure, addressing complex distributed systems problems and building highly available AI training pipelines.
Responsibilities:
• Develop systems to identify, isolate, and recover from failures in large-scale distributed training workloads.
• Implement proactive error-detection mechanisms, including straggler detection and fault prediction algorithms.
• Ensure stability and consistency across distributed training clusters (e.g., GPU/TPU clusters).
• Optimize recovery time and throughput in the face of hardware or software failures.
• Design and maintain observability systems for monitoring cluster health, training performance, and failure patterns.
• Leverage telemetry data to improve incident response and automate mitigation strategies.
• Build resilience-focused tooling, such as job health monitors, distributed checkpoint systems, and automated recovery workflows.
• Enhance debugging and diagnosis frameworks for distributed training jobs.
• Collaborate with platform engineers, researchers, and ML practitioners to identify pain points and resilience requirements.
• Document and communicate best practices for fault-tolerant AI training.
Qualifications:
Required:
• 5 years of experience in distributed systems, cloud infrastructure, or large-scale machine learning training.
• Proficiency in distributed computing frameworks (e.g., PyTorch DDP, TensorFlow, Horovod).
• Strong knowledge of resilience strategies in distributed systems (e.g., leader election, consensus, retry mechanisms).
• Hands-on experience with observability tools (e.g., Prometheus, Grafana, ELK stack).
• Proficient in Python, Go, or a similar programming language.
• Experience working with cloud platforms (e.g., AWS, GCP, Azure) and Kubernetes for workload orchestration.
• Strong analytical, problem-solving, and debugging skills.
• Excellent collaboration and communication skills.
Preferred:
• Familiarity with GPU/TPU cluster management and scheduling.
• Experience with high-availability database systems or message queues.
• Experience with open-source contributions or community engagement.
Company:
Together AI is a cloud-based platform designed for constructing open-source generative AI and infrastructure for developing AI models. Founded in 2022, the company is headquartered in San Francisco, California, USA, with a team of 201-500 employees. The company is currently Growth Stage. Together AI has a track record of offering H1B sponsorships.