Demo

Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Amazon
Cupertino, CA Full Time
POSTED ON 4/13/2026
AVAILABLE BEFORE 6/13/2026

Description

Annapurna Labs designs silicon and software that accelerates innovation. Our custom chips, accelerators, and software stacks enable us to tackle unprecedented technical challenges and deliver solutions that help customers change the world. AWS Neuron is the complete software stack powering AWS Trainium (Trn2/Trn3), our cloud scale Machine Learning accelerators and we are seeking a Senior Software Engineer to join our ML Distributed Training team.

In this role, you will be responsible for the development, enablement, and performance optimization of large scale ML model training across diverse model families. This includes massive scale pre-training and post-training of LLMs with Dense and Mixture-of-Experts architectures, Multimodal models that are transformer and diffusion based, and Reinforcement Learning workloads. You will work at the intersection of ML research and high performance systems, collaborating closely with chip architects, compiler engineers, runtime engineers and AWS solution architects to deliver cost-effective, performant machine learning solutions on AWS Trainium based systems.

Key job responsibilities
You will design, implement and optimize distributed training solutions for large scale ML models running on Trainium instances. A significant part of your work will involve extending and optimizing popular distributed training frameworks including FSDP (Fully-Sharded Data Parallel), torchtitan and Hugging Face libraries for the Neuron ecosystem.
A core focus of this role involves developing and optimizing mixed-precision and low-precision training techniques. You will work with BF16, FP8, and emerging numerical formats to maximize training throughput while maintaining model accuracy and convergence quality. This requires implementing precision aware training strategies, loss scaling techniques, and careful gradient management to ensure training stability across reduced precision formats. Understanding the tradeoffs between computational efficiency and numerical fidelity is essential to success in this position.
Beyond precision optimization, you will profile, analyze, and tune end-to-end training pipelines to achieve optimal performance on Trainium hardware. You will partner with hardware, compiler, and runtime teams to influence system design and unlock new capabilities. Additionally, you will work directly with AWS solution architects and customers to deploy and optimize training workloads at scale.


About the team
Annapurna Labs was a startup company acquired by AWS in 2015, and is now fully integrated. If AWS is an infrastructure company, then think Annapurna Labs as the infrastructure provider of AWS. Our org covers multiple disciplines including silicon engineering, hardware design and verification, software, and operations. AWS Nitro, ENA, EFA, Graviton and F1 EC2 Instances, AWS Neuron, Inferentia and Trainium ML Accelerators, and in storage with scalable NVMe, are some of the products we have delivered, over the last few years.

Basic Qualifications

- Bachelor's degree in computer science or equivalent
- 5 years of non-internship professional software development experience
- 5 years of programming with at least one software programming language experience
- 5 years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- 5 years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- Experience as a mentor, tech lead or leading an engineering team
- Experience in machine learning, large scale training with LLMs and expertise in Pytorch.

Preferred Qualifications

- Master's degree in computer science or equivalent
- Experience in computer architecture
- Previous software engineering expertise with Pytorch/Jax/Tensorflow, Distributed libraries and Frameworks, End-to-end Model Training.

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Los Angeles County applicants: Job duties for this position include: work safely and cooperatively with other employees, supervisors, and staff; adhere to standards of excellence despite stressful conditions; communicate effectively and respectfully with employees, supervisors, and staff to ensure exceptional customer service; and follow all federal, state, and local laws and Company policies. Criminal history may have a direct, adverse, and negative relationship with some of the material job duties of this position. These include the duties and responsibilities listed above, as well as the abilities to adhere to company policies, exercise sound judgment, effectively manage stress and work safely and respectfully with others, exhibit trustworthiness and professionalism, and safeguard business operations and the Company’s reputation. Pursuant to the Los Angeles County Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.



USA, CA, Cupertino - 193,300.00 - 261,500.00 USD annually

Salary.com Estimation for Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training in Cupertino, CA
$153,767 to $185,256
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Amazon

  • Amazon St Georges, DE
  • Description Join our Global Medical and Health (GMH) Team within Workplace Health & Safety, to drive occupational health excellence within Amazon's operati... more
  • 9 Days Ago

  • Amazon Johnston, RI
  • Description This is not a corporate, remote or office-based position. This is a full-time, entry level position located within one of Amazon’s fulfillment ... more
  • 9 Days Ago

  • Amazon Johnston, RI
  • Description At Amazon, we are working to be the most customer-centric company on earth. To get there, we need exceptionally talented, bright, and driven le... more
  • 9 Days Ago

  • Amazon Boise, ID
  • Description At eero we pride ourselves in providing every customer a world class experience.Our mission is to make technology in homes and businesses just ... more
  • 9 Days Ago


Not the job you're looking for? Here are some other Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training jobs in the Cupertino, CA area that may be a better fit.

  • Amazon Web Services (AWS) Cupertino, CA
  • Description The Annapurna Labs team at Amazon Web Services (AWS) builds AWS Neuron, the software development kit used to accelerate deep learning and GenAI... more
  • 1 Day Ago

  • Amazon Web Services (AWS) Cupertino, CA
  • Description The Annapurna Labs team at Amazon Web Services (AWS) builds AWS Neuron, the software development kit used to accelerate deep learning and GenAI... more
  • 8 Days Ago

AI Assistant is available now!

Feel free to start your new journey!