Demo

Senior HPC & GPU Infrastructure Engineer

Sciforium
San Francisco, CA Full Time
POSTED ON 12/5/2025 CLOSED ON 12/24/2025

What are the responsibilities and job description for the Senior HPC & GPU Infrastructure Engineer position at Sciforium?

Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by multi-million-dollar funding and direct sponsorship from AMD with hands-on support from AMD engineers the team is scaling rapidly to build the full stack powering frontier AI models and real-time applications.

We offer a fast-moving, collaborative environment where engineers have meaningful impact, learn quickly, and tackle deep technical challenges across the AI systems stack.

Role Overview

We are seeking a Senior HPC & GPU Infrastructure Engineer to take full ownership of the health, reliability, and performance of our GPU compute cluster. You will be the primary PyTOrchcustodian of our high-density accelerator environment and the linchpin between hardware operations, distributed systems, and machine learning workflows. This role spans everything from hands-on Linux systems engineering and GPU driver bring-up to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you love squeezing every bit of performance out of hardware, enjoy debugging GPUs at scale, and want to build world-class AI infrastructure, this role is for you.

Key Responsibilities

  • System Health & Reliability (SRE)
  • On-Call Response: Act as the primary responder for system outages, GPU failures, node crashes, and cluster-wide incidents. Minimize downtime by resolving issues rapidly.
  • Cluster Monitoring: Implement and maintain monitoring for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and overall system load.
  • Vendor Liaison: Coordinate with data center staff, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster.
  • Linux & Network Administration
  • OS Management: Install, patch, and maintain Linux distributions (Ubuntu / CentOS / RHEL). Ensure consistent configuration, kernel tuning, and automation for large node fleets.
  • Security & Access Controls: Configure VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computer infrastructure.
  • Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity, and administer distributed file systems such as NFS, GPFS, or Lustre.
  • GPU & ML Stack Engineering
  • Deployment & Bring-Up: Lead deployment of new GPU nodes, including BIOS configuration, NUMA tuning, GPU topology validation, and cluster integration.
  • Driver & Kernel Management: Build and optimize kernel modules, maintain GPU drivers and runtime stacks for both NVIDIA (CUDA) and AMD (ROCm).
  • Software Stack Maintenance: Maintain and optimize ML frameworks and libraries PyTorch, JAX, CUDA toolkit, cuDNN, ROCm, NCCL, and supporting runtime systems.
  • Advanced Debugging: Troubleshoot complex interactions involving GPUs, compilers, ML frameworks, and distributed training runtimes (e.g., vLLM compilation failures, CUDA memory leaks, ROCm kernel crashes).

Must-Haves

  • 5 years of experience in HPC, GPU cluster operations, Linux systems engineering, or similar roles.
  • Bachelor’s or Master’s degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field.
  • Strong expertise with NVIDIA (H100/B200) or AMD (MI325x/MI355x) GPUs, including driver and kernel-level debugging.
  • Deep understanding of Linux internals, kernel modules, hardware bring-up, and systems performance tuning.
  • Experience with network security, including VPNs, iptables/firewalld, SSH, and identity management (LDAP/FreeIPA/AD).
  • Proficiency in Bash and Python for scripting, automation, and workflow tooling.
  • Familiarity with ML software stacks: CUDA toolkit, cuDNN, NCCL, ROCm, JAX/PyTorch runtime behavior.
  • Deep debugging experience with NVLink/NVSwitch fabrics and RDMA networking.

Nice-to-Haves

  • Experience with job schedulers such as Slurm, Kubernetes, or Run:AI.
  • Exposure to vLLM, model serving optimizations, or inference systems.
  • Hands-on experience with configuration management tools (Ansible, SaltStack, Terraform).
  • Previous experience supporting ML research teams in a startup or research-heavy environment.

Why Join Us

  • Opportunity to build frontier-scale AI infrastructure powering next-generation LLMs and multimodal models.
  • Work with top-tier engineers and researchers across systems, GPUs, and ML frameworks.
  • Tackle high-impact performance and scalability challenges in training and inference.
  • Access state-of-the-art GPU clusters, datasets, and tooling.
  • Opportunity to publish, patent, and push the boundaries of modern AI
  • Join a culture of innovation, ownership, and fast execution in a rapidly scaling AI organization.

Benefits Include

  • Medical, dental, and vision insurance
  • 401k plan
  • Daily lunch, snacks, and beverages
  • Flexible time off
  • Competitive salary and equity

Equal opportunity

Sciforium is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Lead HPC Infrastructure Engineer
ExecutivePlacements.com -
San Francisco, CA
Staff Engineer GPU infrastructure
DigitalOcean -
San Francisco, CA
Software Engineer, GPU Infrastructure
OpenAI -
San Francisco, CA

Salary.com Estimation for Senior HPC & GPU Infrastructure Engineer in San Francisco, CA
$114,315 to $141,237
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Senior HPC & GPU Infrastructure Engineer?

Sign up to receive alerts about other jobs on the Senior HPC & GPU Infrastructure Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$129,363 - $167,316
Income Estimation: 
$145,845 - $177,256
Income Estimation: 
$147,836 - $182,130
Income Estimation: 
$154,597 - $194,610
Income Estimation: 
$86,891 - $130,303
Income Estimation: 
$105,207 - $132,120
Income Estimation: 
$127,470 - $161,562
Income Estimation: 
$94,567 - $126,847
This job has expired.
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Sciforium

  • Sciforium San Francisco, CA
  • Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by mu... more
  • 6 Days Ago

  • Sciforium San Francisco, CA
  • Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by mu... more
  • 7 Days Ago

  • Sciforium San Francisco, CA
  • Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by mu... more
  • 7 Days Ago

  • Sciforium San Francisco, CA
  • Sciforium is an AI infrastructure company that develops advanced AI models and operates a proprietary serving platform. Following new multi-million dollar ... more
  • 13 Days Ago


Not the job you're looking for? Here are some other Senior HPC & GPU Infrastructure Engineer jobs in the San Francisco, CA area that may be a better fit.

  • OpenAI San Francisco, CA
  • About The Team The Fleet team at OpenAI supports the computing environment that powers our cutting-edge research and product development. We oversee large-... more
  • 5 Days Ago

  • Roblox San Mateo, CA
  • Job Details Every day, tens of millions of people come to Roblox to explore, create, play, learn, and connect with friends in 3D immersive digital experien... more
  • 28 Days Ago

AI Assistant is available now!

Feel free to start your new journey!