Demo

ML Cluster Operations Engineer

TensorWave
Las Vegas, NV Full Time
POSTED ON 11/18/2025 CLOSED ON 12/17/2025

What are the responsibilities and job description for the ML Cluster Operations Engineer position at TensorWave?

ML Cluster Operations Engineer (Slurm / K8s)

At TensorWave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what’s possible in the AI landscape.

About the Role:

We are seeking an exceptional Machine Learning Engineer who has made training and AI workload scheduling a specialty. This is a senior-level role for someone who has significant experience managing distributed machine learning workloads at scale using Slurm and/or Kubernetes.

As a technical visionary and hands-on expert, you will lead the evolution of our managed Slurm and Kubernetes offerings, as well as internal health checking and cluster automation.

Key Responsibilities:

  • Manage and iterate our containerized Slurm (Slurm-in-Kubernetes) solution, including customer configuration and deployment.
  • Work closely with our engineering team to develop and maintain CI and automation for managed offerings.
  • Ensure healthy cluster operations and uptime by implementing active and passive health checks, including automated node draining and triage.
  • Help profile and debug distributed workloads, from small inference jobs to cluster-wide training.
  • Establish best practices for running jobs at scale, including monitoring, checkpointing, etc.
  • Mentor and upskill ML engineers in best practices.

Qualifications:

Must-Have:

  • 5 years of experience in cloud infrastructure, HPC, or machine learning roles.
  • Significant hands-on experience with Slurm in production HPC/ML environments, including understanding of setup/configuration, enroot (pyxis), modules, and MPI.
  • Strong knowledge of distributed ML languages and frameworks, such as Python, PyTorch, Megatron, c10d, MPI, etc.
  • Understanding of node lifecycle, including health checks, prolog / epilog scripts, and draining.
  • Deep understanding of security, compliance, and resilience in containerized workloads.

Nice-to-Have:

  • 3 years of hands-on Kubernetes experience, including deep knowledge of the Kubernetes API, internals, networking, and storage.
  • Proficiency in writing Kubernetes manifests, Helm charts, and managing releases.
  • Experience with DAGs using K8s native tools such as Argo Workflows.
  • Foundation in networking, especially as it pertains to RDMA, RoCE, and Infiniband.
  • Experience with low level kernel libraries, such as CUDA and Composable Kernel.
  • Contributions to open-source projects or ML/AI tooling.

What Success Looks Like

  • A production-grade integrated Slurm platform that can support thousands of GPUs, with self-healing, scaling, and strong observability.
  • Infrastructure is resilient, secure, resource-optimized, and compliant.
  • Best practices and tooling are well-documented, standardized, and continuously improved across the company.
  • Make GPUs go Brrrrrrr

What We Bring:

Stock Options

100% paid Medical, Dental, and Vision insurance

Life and Voluntary Supplemental Insurance

Short Term Disability Insurance

Flexible Spending Account

401(k)

Flexible PTO

Paid Holidays

Parental Leave

Mental Health Benefits through Spring Health

Salary.com Estimation for ML Cluster Operations Engineer in Las Vegas, NV
$74,441 to $89,905
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a ML Cluster Operations Engineer?

Sign up to receive alerts about other jobs on the ML Cluster Operations Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$77,900 - $95,589
Income Estimation: 
$101,387 - $124,118
Income Estimation: 
$77,900 - $95,589
Income Estimation: 
$101,387 - $124,118
Income Estimation: 
$101,387 - $124,118
Income Estimation: 
$119,030 - $151,900
Income Estimation: 
$184,796 - $233,226
Income Estimation: 
$179,606 - $233,815
Income Estimation: 
$119,030 - $151,900
Income Estimation: 
$149,493 - $192,976
This job has expired.
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at TensorWave

  • TensorWave Las Vegas, NV
  • At TensorWave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focus... more
  • 1 Day Ago

  • TensorWave Las Vegas, NV
  • Our mission at Tensorwave Cloud is to build seamless, secure, reliable, and resilient AI infrastructure at scale, eliminating barriers and challenging the ... more
  • 1 Day Ago

  • TensorWave Las Vegas, NV
  • About TensorWave TensorWave is at the forefront of the GPU cloud revolution, delivering cutting-edge infrastructure that powers the next wave of AI and mac... more
  • 9 Days Ago

  • TensorWave Las Vegas, NV
  • About TensorWave Our mission is simple: deliver seamless, secure, reliable, and resilient AI compute at scale. We've built a versatile cloud platform that ... more
  • 11 Days Ago


Not the job you're looking for? Here are some other ML Cluster Operations Engineer jobs in the Las Vegas, NV area that may be a better fit.

  • Meltwater Las Vegas, NV
  • Description About the Role: Design, build, and productionize large-scale NLP and LLM systems that power information extraction, classification, ranking, cl... more
  • 6 Days Ago

  • Invictus AI Henderson, NV
  • Company Description Invictus AI is a leading artificial intelligence applications company dedicated to making personalized AI agents accessible to everyone... more
  • 14 Days Ago

AI Assistant is available now!

Feel free to start your new journey!