What are the responsibilities and job description for the Senior Software Engineer position at Acceler8 Talent?
Software Engineer: ML Infrastructure
Robotics x AI Startup | Foundation Models for Robotics | On-site (SF Bay Area or Boston)
We’re hiring a Software Engineer: ML Infrastructure to join a frontier AI and robotics company building large-scale foundation models for real-world robotic systems.
This team is training next-generation embodied AI models — requiring massive GPU clusters, highly distributed systems, and ultra-efficient data pipelines to operate at scale.
You’ll own the infrastructure powering both large-scale model training and real-time robot inference — working across cloud and on-device systems with strict latency and performance constraints.
You’ll work on:
- GPU infrastructure powering large-scale distributed training
- High-performance data pipelines handling massive datasets
- Systems that maximize utilization across GPU fleets
- Real-time inference infrastructure running directly on robots
- Orchestration of both cloud-based and on-device ML workloads
What You’ll Do:
- Own and manage large-scale GPU compute fleets (cloud on-prem)
- Ensure infrastructure is highly usable, reliable, and optimized for researchers
- Design and optimize ML data loading, transport, and storage systems
- Orchestrate distributed training and inference workloads
- Improve efficiency across compute, networking, and storage layers
What We’re Looking For:
- Experience managing large GPU clusters for distributed training or inference
- Strong background in ML infrastructure and systems at scale
- Hands-on experience with Slurm, Kubernetes, or similar orchestration tools
- Experience building high-performance data pipelines for ML workloads
- Deep understanding of hardware, storage, and networking layers
- Familiarity with the NVIDIA GPU ecosystem
On-site — San Mateo (SF Bay Area) or Somerville (Boston)
Open regarding salary compensation