Demo

Machine Learning Engineer - Distributed ML Systems

Pluralis Research
California, CA Full Time
POSTED ON 5/27/2026
AVAILABLE BEFORE 6/26/2026
Overview

Pluralis Research carries out foundational research on Protocol Learning: multi-participant training of foundation models where no single participant has, or can ever obtain, a full copy of the model. The purpose of Protocol Learning is to facilitate the creation of community-trained and community-owned frontier models with self-sustaining economics.

We're looking for Senior/Staff engineers with 5 years of experience in distributed systems and ML large-scale training. You'll be implementing a novel substrate for training distributed ML models that work under consumer grade internet connection.

Responsibilities

Distributed Training Architecture & Optimization

  • Design and implement large-scale distributed training systems optimized for heterogeneous hardware operating under low-bandwidth, high-latency conditions.
  • Develop and optimize model-parallel training strategies (data, tensor, pipeline parallelism) with custom sharding techniques that minimize communication overhead.
  • Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes.
  • Implement robust checkpointing, state synchronization, and recovery mechanisms for long-running, fault-prone training jobs.
  • Build monitoring and metrics systems to track training progress, model quality, and system bottlenecks.

Decentralized Networking & Resilience

  • Architect resilient training systems where nodes can fail, networks can partition, and participants can dynamically join or leave.
  • Design and optimize peer-to-peer topologies for decentralized coordination across non-co-located nodes.
  • Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management.
  • Profile and optimize communication patterns to reduce latency and bandwidth overhead in multi-participant environments.

What You’ll Bring

  • Strong experience building and operating distributed systems in production.
  • Hands-on expertise with distributed training frameworks (FSDP, DeepSpeed, Megatron, or similar).
  • Deep understanding of model parallelism (data, tensor, pipeline parallelism).
  • Expert-level Python with production experience (concurrency, error handling, retry logic, clean architecture).
  • Strong networking fundamentals: P2P systems, gRPC, routing, NAT traversal, distributed coordination.
  • Experience optimizing GPU workloads, memory management, and large-scale compute efficiency.

What we offer

  • Equity-heavy compensation with meaningful ownership in a mission-driven company
  • Competitive base salary for senior engineering roles in Australia
  • Visa sponsorship available for exceptional candidates
  • Remote-first with optional access to our Melbourne hub
  • World-class team — team mates were previously at at Google, Amazon, Microsoft, and leading startups

Backed by Union Square Ventures and other tier-1 investors, we're a world-class, deeply technical team of ML researchers and engineers. Pluralis is unapologetically ideological. We view the world as a better place if we are able to implement what we are attempting, and Protocol Learning as the only plausible approach to preventing a handful of massive corporations monopolising model development, access and release, and achieving massive economic capture. If this resonates, please apply.

Salary.com Estimation for Machine Learning Engineer - Distributed ML Systems in California, CA
$127,525 to $149,878
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Machine Learning Engineer - Distributed ML Systems?

Sign up to receive alerts about other jobs on the Machine Learning Engineer - Distributed ML Systems career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$123,167 - $152,295
Income Estimation: 
$146,673 - $180,130
Income Estimation: 
$107,566 - $124,747
Income Estimation: 
$131,462 - $151,444
Income Estimation: 
$96,072 - $129,026
Income Estimation: 
$144,138 - $187,517
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Not the job you're looking for? Here are some other Machine Learning Engineer - Distributed ML Systems jobs in the California, CA area that may be a better fit.

  • Red Hat Boston, MA
  • Job Summary At Red Hat we believe the future of AI is open and we are on a mission to bring the power of open-source LLMs and vLLM to every enterprise. The... more
  • 1 Day Ago

  • redhat range, AL
  • Job Summary At Red Hat we believe the future of AI is open and we are on a mission to bring the power of open-source LLMs and vLLM to every enterprise. The... more
  • 3 Days Ago

AI Assistant is available now!

Feel free to start your new journey!