Demo

Software Engineer - Distributed Training

ExecutivePlacements.com
Palo Alto, CA Full Time
POSTED ON 11/5/2025
AVAILABLE BEFORE 12/4/2025
Join to apply for the Software Engineer - Distributed Training role at Clockwork Systems, Inc.

Continue with Google Continue with Google

Software Engineer - Distributed Training

1 week ago Be among the first 25 applicants

Join to apply for the Software Engineer - Distributed Training role at Clockwork Systems, Inc.

Get AI-powered advice on this job and more exclusive features.

Sign in to access AI-powered advices

Continue with Google Continue with Google

Continue with Google Continue with Google

Continue with Google Continue with Google

Continue with Google Continue with Google

Continue with Google Continue with Google

Continue with Google Continue with Google

Clockwork.io is a Silicon Valley startup that delivers state-of-the-art AI compute acceleration.

We are founded by Stanford researchers and veteran systems engineers with a shared belief: distributed systems powering modern AI require a new approach to managing time, reliability, and performance. Unlike traditional solutions that rely on specialized hardware or embedded telemetry in switches, Clockwork's system brings insane visibility, resilience, acceleration and efficiency to the network layer entirely through software. As AI workloads continue to scale in size, urgency, and impact, networks must evolve to keep up. Clockwork exists to make that evolution possible.

About Us

Clockwork.io A Software-Driven Revolution in AI Networking

Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex, traditional infrastructure struggles to meet the demands of performance, reliability, and precise coordination. Clockwork is pioneering a software-driven approach to AI networking, delivering deterministic time, ultra-low latency, and seamless scalability for modern distributed systems.

To learn more, visit www.clockwork.io.

About The Role

We are looking for an experienced software engineer to help build, optimize, and maintain large-scale distributed training infrastructure based on the PyTorch ecosystem. This role focuses on production-grade training workflows involving multi-GPU and multi-node orchestration, high-performance communication layers, and advanced parallelism strategies.

You'll work alongside infrastructure and machine learning teams to ensure training jobs are efficient, scalable, and resilient.

What You Will Do

  • Develop and support distributed PyTorch training jobs using torch.distributed / c10d
  • Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks
  • Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)
  • Optimize performance across communication, I/O, and memory bottlenecks
  • Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs
  • Write tooling and scripts to streamline training workflows and experiment management
  • Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)

What We're Looking For

  • Deep experience with PyTorch and torch.distributed (c10d)
  • Hands-on experience with at least one of: Megatron-LM, DeepSpeed, or FairScale
  • Proficiency in Python and Linux shell scripting
  • Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar
  • Strong understanding of NCCL, collective communication, and GPU topology
  • Familiarity with debugging tools and techniques for distributed systems

Preferred Skills

  • Experience scaling LLM training across 8 GPUs and multiple nodes
  • Knowledge of tensor, pipeline, and data parallelism
  • Familiarity with containerized training environments (Docker, Singularity)
  • Exposure to HPC environments or cloud GPU infrastructure
  • Experience with training workload orchestration tools or custom job launchers
  • Comfort with large-scale checkpointing, resume/restart logic, and model I/O

Bonus Skills

  • Profiling tools: PyTorch Profiler, Nsight, nvprof, or equivalent
  • Experience with performance tuning in distributed training environments
  • Contributions to ML infrastructure open-source projects
  • Familiarity with storage, networking, or RDMA/GPU Direct technologies
  • Understanding of observability in ML pipelines (metrics, logs, dashboards)

Enjoy

  • Challenging projects.
  • A friendly and inclusive workplace culture.
  • Competitive compensation.
  • A great benefits package.
  • Catered lunch

Clockwork is assembling world class teams to build cutting edge software. We look for bright people from all walks of life and we grow together. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity, national origin, or protected veteran status and will not be discriminated against on the basis of disability.

Seniority level

  • Seniority level Mid-Senior level

Employment type

  • Employment type Full-time

Job function

  • Job function Engineering and Information Technology
  • Industries Software Development

Referrals increase your chances of interviewing at Clockwork Systems, Inc. by 2x

Get notified about new Software Engineer jobs in Palo Alto, CA.

Palo Alto, CA $160,000 - $180,000 19 hours ago

Software Engineer, AI Intern (Fall 2025)

San Francisco Bay Area $57 - $61 2 weeks ago

Mountain View, CA $125,400 - $188,100 1 week ago

Software Engineer, AI Platform - New Grad

San Jose, CA $130,000 - $180,000 1 week ago

Software Engineer (L4), Content & Business Products

New Grads 2025 - Software Engineer, Algorithm

San Jose, CA

$120,000.00

$165,000.00

9 months ago

Palo Alto, CA

$96,000.00

$200,000.00

2 weeks ago

New Grads 2025 - General Software Engineer

San Jose, CA

$120,000.00

$165,000.00

5 months ago

Alameda, CA

$130,000.00

$160,000.00

3 weeks ago

Software Engineer(s) - New Grad (Fall 2025 Graduation)

Software Engineer 4 - TV & Web Player Platform

San Francisco Bay Area

$160,000.00

$180,000.00

20 hours ago

Palo Alto, CA

$115,000.00

$260,000.00

1 hour ago

Full Stack Software Engineer - Post-training

(General Hire) Software Engineer Graduate (Advertisement Team) - 2025 Start (BS/MS)

San Jose, CA

$113,500.00

$250,000.00

2 weeks ago

Full Stack Software Engineer (L4), Product Localization Engineering

San Jose, CA

$142,400.00

$190,100.00

2 weeks ago

Sunnyvale, CA

$117,000.00

$234,000.00

1 week ago

San Jose, CA

$113,400.00

$206,300.00

1 week ago

San Jose, CA

$113,400.00

$206,300.00

2 weeks ago

Software Engineer(s) - New Grad (Fall 2025 Graduation)

San Jose, CA $113,400 - $206,300 2 weeks ago

Santa Clara, CA $150,000 - $175,000 7 months ago

San Jose, CA $113,400 - $206,300 2 weeks ago

Palo Alto, CA $152,400 - $228,700 2 weeks ago

New College Grad Software Engineer, Software Engineering Development (Apps)

San Jose, CA $92,735 - $131,300 6 days ago

Were unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr

Salary : $160,000 - $180,000

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Software Engineer - Distributed Training?

Sign up to receive alerts about other jobs on the Software Engineer - Distributed Training career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$123,167 - $152,295
Income Estimation: 
$146,673 - $180,130
Income Estimation: 
$176,149 - $220,529
Income Estimation: 
$156,679 - $196,968
Income Estimation: 
$77,657 - $95,021
Income Estimation: 
$97,257 - $120,701
Income Estimation: 
$97,257 - $120,701
Income Estimation: 
$123,167 - $152,295
Income Estimation: 
$123,167 - $152,295
Income Estimation: 
$146,673 - $180,130
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at ExecutivePlacements.com

ExecutivePlacements.com
Hired Organization Address Alaska, AK Full Time
DevSecOps Architect - Application Security & Software Supply Chain Remote Visa- Open 6 Months Key Responsibilities Desig...
ExecutivePlacements.com
Hired Organization Address Providence, RI Full Time
HRIS Analyst Job Posting The HRIS Analyst is a support level position within the HRIS organization. This position is res...
ExecutivePlacements.com
Hired Organization Address Wilmington, DE Intern
Capital One Software is seeking a Senior Manager, Data Engineering who is passionate about marrying innovation with emer...
ExecutivePlacements.com
Hired Organization Address Wilmington, DE Part Time
Ever since our first credit card customer in 1994, Capital One has recognized that technology and data can enable even l...

Not the job you're looking for? Here are some other Software Engineer - Distributed Training jobs in the Palo Alto, CA area that may be a better fit.

Software Engineer - Distributed Training

clockworksystems, Palo Alto, CA

Software Engineer - Distributed Training Infrastructure

Clockwork Systems, Inc., Palo Alto, CA

AI Assistant is available now!

Feel free to start your new journey!