Demo

HPC Engineer - Research Infrastructure

Luma AI
Palo Alto, CA Full Time
POSTED ON 10/7/2025
AVAILABLE BEFORE 11/14/2025
About Luma AI

Luma's mission is to build multimodal AI by pushing the boundaries of what is possible with large-scale supercomputing. We are building some of the biggest and fastest AI clusters in the world, and this role is at the very heart of that effort. This requires a deep, first-principles understanding of how hardware and software intersect to unlock maximum performance.

Where You Come In

This is a rare, foundational role for a hybrid SRE/HPC engineer with elite, low-level expertise in GPUs, high-performance networking, and Linux. You will be responsible for the absolute performance and stability of our massive, GPU supercomputing infrastructure. This role demands the ability to design, debug, and optimize at every level of the stack. You will manage our training clusters from provisioning to performance tuning, ensuring our researchers have the most powerful and efficient platform possible.

What You'll Do

  • Architect & Optimize Supercomputers: Design, build, and tune systems that combine CPUs, GPUs (NVIDIA and AMD), and high-performance networking into world-class clusters.
  • Master Low-Level Performance: Dive deep into the Linux OS, device drivers, and user-space code to optimize performance at every level of the stack.
  • Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA.
  • Manage HPC Schedulers: Architect and manage modern HPC job management frameworks like Kubernetes, designing queues and partitions setups to maximize throughput and utilization for mixed research workloads.
  • Build Automation for Scale: Write code to automate the monitoring, diagnostics, and healing of thousands of servers, enabling a massive infrastructure footprint with a small, elite team.

Who You Are

  • 8 years of experience as an Infrastructure, DevOps, or HPC engineer working on large, complex distributed systems.
  • You have deep, hands-on experience managing and troubleshooting large GPU clusters from provisioning to monitoring
  • You are an expert in high-performance networking, with practical experience in InfiniBand, RDMA, or RoCE.
  • You possess extensive knowledge of Linux systems, including performance tuning, debugging, and configuration.
  • You have a deep understanding of modern HPC job management systems based on Kubernetes, and are familiar with workflow orchestration frameworks like Ray or Flyte.
  • You have architected, built, and maintained large-scale Kubernetes clusters from first principles, including managing the control plane and node components in a production environment.
  • You are an independently driven, tenacious problem-solver who can own issues from end-to-end.

What Sets You Apart (Bonus Points)

  • Experience at national labs, research universities, or companies known for their large-scale, on-prem supercomputing infrastructure, with a focus on containerized applications
  • Deep expertise with GPU tooling for NVIDIA and AMD GPUs, like DCGM or ROCm

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a HPC Engineer - Research Infrastructure?

Sign up to receive alerts about other jobs on the HPC Engineer - Research Infrastructure career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$101,597 - $131,824
Income Estimation: 
$104,896 - $133,785
Income Estimation: 
$123,198 - $153,566
Income Estimation: 
$103,114 - $138,258
Income Estimation: 
$118,163 - $145,996
Income Estimation: 
$120,777 - $151,022
Income Estimation: 
$129,363 - $167,316
Income Estimation: 
$86,891 - $130,303
Income Estimation: 
$129,363 - $167,316
Income Estimation: 
$145,845 - $177,256
Income Estimation: 
$147,836 - $182,130
Income Estimation: 
$154,597 - $194,610
Income Estimation: 
$86,891 - $130,303
Income Estimation: 
$81,253 - $112,554
Income Estimation: 
$89,966 - $112,616
Income Estimation: 
$95,407 - $122,738
Income Estimation: 
$103,114 - $138,258
Income Estimation: 
$86,891 - $130,303
Income Estimation: 
$154,597 - $194,610
Income Estimation: 
$172,688 - $210,712
Income Estimation: 
$170,589 - $211,671
Income Estimation: 
$178,619 - $225,190
Income Estimation: 
$86,891 - $130,303
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Luma AI

Luma AI
Hired Organization Address Palo Alto, CA Full Time
About Luma AI Luma's mission is to build multimodal AI to expand human imagination and capabilities. We believe that mul...

Not the job you're looking for? Here are some other HPC Engineer - Research Infrastructure jobs in the Palo Alto, CA area that may be a better fit.

Staff Engineer, HPC EDA Infrastructure - GRID

Advanced Micro Devices, Inc, San Jose, CA

AI Assistant is available now!

Feel free to start your new journey!