Demo

Lead HPC Infrastructure Engineer

ExecutivePlacements.com
San Francisco, CA Full Time
POSTED ON 12/26/2025
AVAILABLE BEFORE 1/25/2026
Lead Hpc Infrastructure Engineer

We are seeking a highly accomplished engineer to take ownership of the operations and optimization of next-generation NVIDIA GB200 and GB300 GPU clusters. This role sits at the intersection of high-performance computing and AI infrastructure, where precision, automation, and scale meet innovation.

You will shape and maintain the reliability of some of the most advanced computer systems ever built; leveraging Linux, Kubernetes, Terraform, Ansible, and Helm to enable seamless, intelligent operations.

This is a rare opportunity to work on cutting-edge GPU infrastructure, solving complex challenges that push the boundaries of performance and efficiency.

Office Travel: Frequent on-site work is required for this position (23 days/week) at our Santa Clara, CA office.

Job Responsibilities

You will take ownership of mission-critical NVIDIA GB200 and GB300 clusters, ensuring their reliability, performance, and continuous operation.

You will act as the first responder and escalation point for operational issues, leading response efforts with calm and technical precision.

You will design, develop, and maintain Infrastructure as Code (IaC) solutions that enable automation, diagnostics, and deployment across Slurm and Kubernetes environments.

You will proactively analyze system logs, metrics, and telemetry to identify subtle anomalies, anticipate failures, and prevent service degradation.

You will perform deep, system-wide diagnostics on Grace Blackwell Superchips and NVLink fabric, driving root cause analysis and continuous improvement.

You will document operational knowledge creating troubleshooting guides, procedures, and runbooks for complex or novel incidents.

You will lead and coordinate incident management efforts, collaborating with engineering teams and external partners to restore system stability.

You will mentor early-career engineers, promoting a culture of learning, ownership, and operational excellence.

You will communicate asynchronously and effectively, providing clear, detailed, and actionable updates to global teams.

You will maintain accountability and focus in a 12x7 on-call rotation, ensuring fast, accurate support for cluster operations.

Job Qualifications

Technical Skills

You bring deep expertise in Linux systems engineering, including kernel-level troubleshooting and performance analysis.

You bring hands-on experience with HPC workload schedulers such as Slurm and Kubernetes (K8s) for orchestration and resource allocation.

You build automation and Infrastructure as Code with Terraform, Ansible, and Helm, ensuring consistency across large-scale environments.

You have advanced scripting proficiency in Python and Bash for automation, data parsing, and diagnostic tooling.

You understand GPU compute architecture, NVLink, Infiniband, and collective communication libraries (MPI, NCCL) at an expert operational level.

You have experience supporting frontline HPC operations in national laboratories, cloud providers, or large-scale technology organizations.

Professional Skills

You demonstrate strong ownership and accountability in high-stakes, time-sensitive environments.

You collaborate effectively across engineering, operations, and partner teams to solve critical challenges.

You apply structured problem-solving to diagnose and resolve undocumented or complex failures.

You communicate clearly and concisely, translating technical depth into clarity for both technical and non-technical audiences.

You work autonomously and asynchronously, managing ambiguity with focus and precision.

You mentor and uplift others, fostering continuous learning and a shared culture of operational excellence.

Other Things To Know

There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys.

About Thoughtworks

Thoughtworks is a dynamic and inclusive community of bright and supportive colleagues who are revolutionizing tech. As a leading technology consultancy, we're pushing boundaries through our purposeful and impactful work. For 30 years, we've delivered extraordinary impact together with our clients by helping them solve complex business problems with technology as the differentiator. Bring your brilliant expertise and commitment for continuous learning to Thoughtworks. Together, let's be extraordinary.

Salary

The annual salary range posted is subject to many factors and may vary depending on experience, geographic location, job responsibilities, performance, skills and/or training.

$169,000 - $270,000 USD

Salary : $169,000 - $270,000

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Lead HPC Infrastructure Engineer?

Sign up to receive alerts about other jobs on the Lead HPC Infrastructure Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$161,406 - $211,884
Income Estimation: 
$188,022 - $236,092
Income Estimation: 
$205,940 - $255,928
Income Estimation: 
$199,907 - $266,531
Income Estimation: 
$195,700 - $270,403
Income Estimation: 
$79,473 - $93,666
Income Estimation: 
$90,372 - $103,622
Income Estimation: 
$61,825 - $80,560
Income Estimation: 
$90,032 - $105,965
Income Estimation: 
$85,996 - $102,718
Income Estimation: 
$90,032 - $105,965
Income Estimation: 
$111,859 - $131,446
Income Estimation: 
$110,457 - $133,106
Income Estimation: 
$105,809 - $128,724
Income Estimation: 
$122,763 - $145,698
Income Estimation: 
$110,457 - $133,106
Income Estimation: 
$136,611 - $163,397
Income Estimation: 
$135,163 - $163,519
Income Estimation: 
$131,953 - $159,624
Income Estimation: 
$150,859 - $181,127
Income Estimation: 
$135,163 - $163,519
Income Estimation: 
$169,825 - $204,021
Income Estimation: 
$166,631 - $195,636
Income Estimation: 
$162,237 - $199,353
Income Estimation: 
$181,083 - $218,117
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at ExecutivePlacements.com

  • ExecutivePlacements.com Fargo, ND
  • Join to apply for the Remote Customer Success Representative role at Forum Communications Co. Get AI-powered advice on this job and more exclusive features... more
  • 12 Days Ago

  • ExecutivePlacements.com Fargo, ND
  • Job Description This role combines flexibility with leadership responsibility. You will oversee team production and development. Bonuses reward consistency... more
  • 12 Days Ago

  • ExecutivePlacements.com Anchorage, AK
  • About The Job Remote Office Data Entry Specialist We strive to operate in an environmentally sustainable manner and promote land-based environmental progra... more
  • 12 Days Ago

  • ExecutivePlacements.com St Marys, AK
  • Seeking Motivated Individuals For Data Entry Type Work From Home Our company is seeking applicants who are motivated to work from home and participate in p... more
  • 12 Days Ago


Not the job you're looking for? Here are some other Lead HPC Infrastructure Engineer jobs in the San Francisco, CA area that may be a better fit.

  • Lead San Francisco, CA
  • Lead is a fintech building banking infrastructure for embedded financial products and services. We operate an FDIC-insured bank headquartered in Kansas Cit... more
  • 5 Days Ago

  • OpenAI San Francisco, CA
  • About The Team The Fleet team at OpenAI supports the computing environment that powers our cutting-edge research and product development. We oversee large-... more
  • 3 Days Ago

AI Assistant is available now!

Feel free to start your new journey!