What are the responsibilities and job description for the Software Engineer II - Michelangelo position at Uber?
About The Role
Uber's AI platform, Michelangelo, provides end-to-end infrastructure that enables ML engineers and data scientists to build, deploy, and scale machine learning solutions across all critical stages-feature engineering, distributed training, model inference, MLOps, resource management, and monitoring.
The ML Training team is a core pillar of Michelangelo, focused on building large-scale distributed training systems for multi-GPU/TPU environments. The team works across distributed systems, training frameworks, and ML infrastructure to enable efficient, reliable, and scalable model development.
As a Software Engineer II, you will design and build core components of Uber's ML training platform. You will work on challenging distributed systems problems, contribute to system design, and collaborate closely with ML engineers and data scientists to deliver scalable AI solutions.
This role is ideal for engineers who are strong in systems and implementation, and want to work on impactful ML infrastructure at scale.
What The Candidate Will Need / Bonus Points
---- What the Candidate Will Do ----
Uber's AI platform, Michelangelo, provides end-to-end infrastructure that enables ML engineers and data scientists to build, deploy, and scale machine learning solutions across all critical stages-feature engineering, distributed training, model inference, MLOps, resource management, and monitoring.
The ML Training team is a core pillar of Michelangelo, focused on building large-scale distributed training systems for multi-GPU/TPU environments. The team works across distributed systems, training frameworks, and ML infrastructure to enable efficient, reliable, and scalable model development.
As a Software Engineer II, you will design and build core components of Uber's ML training platform. You will work on challenging distributed systems problems, contribute to system design, and collaborate closely with ML engineers and data scientists to deliver scalable AI solutions.
This role is ideal for engineers who are strong in systems and implementation, and want to work on impactful ML infrastructure at scale.
What The Candidate Will Need / Bonus Points
---- What the Candidate Will Do ----
- Design, build, and maintain components of distributed training systems for multi-GPU/TPU environments.
- Implement features and improvements for ML training infrastructure and platform services.
- Collaborate with ML engineers and data scientists to support model development and deployment workflows.
- Write clean, efficient, and maintainable code with proper testing and documentation.
- Debug and resolve issues in distributed systems and ML pipelines with guidance from senior engineers.
- Improve system reliability and performance through incremental enhancements.
- Participate in code reviews and design discussions to learn and apply engineering best practices.
- Contribute to team projects and support overall platform development efforts.
- BS/MS in Computer Science or related field and 2 years of Software Engineering experience (or PhD new graduates).
- Strong programming skills in Python, Java, Go, or C .
- Experience building software systems or services (e.g., backend systems, data pipelines, or infrastructure).
- Familiarity with distributed systems fundamentals.
- Exposure to ML/DL frameworks (e.g., PyTorch, TensorFlow, JAX) or ML workflows.
- Experience with distributed systems or cloud-based infrastructure.
- Familiarity with ML infrastructure or training workflows.
- Exposure to distributed training technologies (e.g., DDP, FSDP, DeepSpeed).
- Basic understanding of GPU/TPU environments and accelerator hardware.
- Experience with data processing systems (e.g., Spark, Ray).
- Interest in performance optimization and system efficiency.
- Strong debugging and problem-solving skills.
Salary : $171,000 - $190,000