Demo

AI Infrastructure — Training Engineer (Large Model) [33251]

Stealth Startup
Menlo Park, CA Full Time
POSTED ON 6/25/2026
AVAILABLE BEFORE 7/23/2026

Responsibilities

  • Distributed training framework optimization. Own the R&D and tuning of distributed training frameworks for large models (LLMs, multimodal), resolving scalability bottlenecks at the scale of 10k–100k GPU clusters.
  • Kernel & performance tuning. Work close to the underlying hardware (NVIDIA GPU / NPU) on kernel acceleration, memory optimization, and communication optimization (tensor parallelism, pipeline parallelism, ZeRO, and related techniques).
  • System resilience & scheduling. Build stable large-scale training clusters; design high-availability fault-tolerance mechanisms (checkpoint/resume, automatic recovery) and compute scheduling strategies to raise overall cluster throughput and resource utilization.
  • Training pipeline engineering. Build an end-to-end MLOps platform spanning data preprocessing, distributed training, model fine-tuning (RLHF / DPO, etc.), and automated evaluation.

Qualifications

  • Education. Bachelor's degree or above in Computer Science, Software Engineering, Electrical Engineering, or a related field.
  • Programming. Very strong engineering implementation skills; proficient in C/C and Python, with a solid foundation in data structures and algorithms.
  • Distributed & parallel computing. Hands-on mastery of mainstream distributed training frameworks such as PyTorch, Megatron-LM, DeepSpeed, DeepSpeed-Chat, or Horovod.
  • Low-level systems & communication. Familiar with Linux internals, the network stack (RoCE/RDMA), GPU communication primitives (e.g., NCCL), and common storage systems.
  • Tuning & debugging. Skilled with profiling and debugging tools such as Nsight, GDB, and PyTorch Profiler; able to quickly diagnose cluster deadlocks, performance bottlenecks, and out-of-memory (OOM) issues.


Salary.com Estimation for AI Infrastructure — Training Engineer (Large Model) [33251] in Menlo Park, CA
$94,619 to $107,738
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a AI Infrastructure — Training Engineer (Large Model) [33251]?

Sign up to receive alerts about other jobs on the AI Infrastructure — Training Engineer (Large Model) [33251] career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$101,387 - $124,118
Income Estimation: 
$119,030 - $151,900
Income Estimation: 
$78,935 - $89,377
Income Estimation: 
$103,503 - $129,573
Income Estimation: 
$93,066 - $107,206
Income Estimation: 
$97,332 - $126,185
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Stealth Startup

  • Stealth Startup Pittsburgh, PA
  • Role Summary The Lead Business Analyst – Manufacturing & ERP Integration is a senior, customer-facing leader responsible for owning business requirements, ... more
  • 1 Day Ago

  • Stealth Startup San Diego, CA
  • About Satomic Satomic’s mission is to close the gap from idea to molecule with faster navigation of chemical space. We are building an automated chemistry ... more
  • 1 Day Ago

  • Stealth Startup Franklin, OH
  • Job Location: Geneva, OH OR Franklin. OH The company is seeking a hands-on Principal Electrical Engineer, to lead our power generation products. This role ... more
  • 3 Days Ago

  • Stealth Startup San Francisco, CA
  • Company Description We're a stealth startup at the frontier of physical AI, building foundation models for robotics and training the world models that will... more
  • 3 Days Ago


Not the job you're looking for? Here are some other AI Infrastructure — Training Engineer (Large Model) [33251] jobs in the Menlo Park, CA area that may be a better fit.

  • Luma AI Palo Alto, CA
  • Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. To go be... more
  • 17 Days Ago

  • Gatik AI Mountain View, CA
  • Who we are Gatik, the leader in autonomous middle-mile logistics, is revolutionizing the B2B supply chain with its autonomous transportation-as-a-service (... more
  • 2 Days Ago

AI Assistant is available now!

Feel free to start your new journey!