Demo

Infrastructure Engineer, LLM Inference Optimization

GMI Cloud
Mountain View, CA Full Time
POSTED ON 6/20/2026
AVAILABLE BEFORE 7/16/2026

About Us

GMI Cloud is a fast-growing AI infrastructure company backed by Headline VC and one of only six cloud providers worldwide to earn NVIDIA's prestigious Reference Platform Cloud Partner designation. We operate 8 of our own GPU clusters across the U.S. and Asia, delivering a full spectrum of services from GPU compute to AI model inference API solutions. As an NVIDIA Reference Platform Cloud Partner, our infrastructure meets the highest standards for performance, security, and scalability in AI deployments. We empower AI startups and enterprises to "build AI without limits," providing everything they need to prototype, train, and deploy AI models quickly and reliably.


About this role

GMI Cloud is building the leading inference optimization solution and the most advanced token platform in the global token market — and we are hiring world-class Machine Learning Engineers to make GMI the new industry benchmark for LLM serving performance, cost efficiency, and production reliability.

This role is for engineers who want to live at the frontier of LLM inference systems. You will drive the research, validation, and productionization of the most advanced inference optimization techniques, and turn them into real competitive advantage over top open-source baselines (vLLM, SGLang, and so on). Our charter is not just to adopt what's published — it is to define the recipes, ship the optimizations, and contribute back to the community that the rest of the industry follows.


Our goal is to make GMI the company that leads the industry in how fast we discover, evaluate, combine, and operationalize the best optimization strategies for real customer workloads. That means not only adopting the latest advances, but also defining best practices, developing our own optimization methodologies, and building the internal framework that keeps GMI ahead of the curve.


You will focus on B200-first optimization, with support for H200 evolution, across core domains including quantization, speculative decoding, KV cache and memory management, prefill/decode disaggregation, and system-level inference optimization. You will work closely with platform and infrastructure teams to transform cutting-edge ideas into measurable gains in latency, throughput, cost efficiency, and production scalability.


Key Responsibilities

  • Design and operate a reliable, repeatable experiment framework for LLM inference (vLLM / SGLang / TensorRT-LLM / NVIDIA Dynamo and in-house solution) across H200 and B200 fleets.
  • Build A/B testing infrastructure against real traffic.
  • Optimize and land optimized inference solution for customized inference scenarios.
  • Build the foundation for agentic / automated inference optimization.
  • Drive reliability and fault recovery of advance inference solution on top open-source models toward the SLA 99.99% target.
  • Build elastic GPU provisioning across H200, B200, and spot machines that powers both production traffic and experimentation.
  • Diagnose and fix the long tail of distributed-inference perf bugs: NCCL stalls, network contention, kernel-level latency.
  • Collaborate with ML engineers to make sure every optimization idea can be cleanly benchmarked, A/B tested, and rolled out.
  • Engage with and contribute to the open-source community (NVIDIA Dynamo, NCCL, vLLM, SGLang, TensorRT-LLM, AA benchmarks).


Required Skills

  • Strong systems / infrastructure engineering background with deep familiarity with Linux, networking, observability, and distributed systems.
  • Proficient in Python and at least one of Go / C / Rust.
  • Hands-on experience operating GPU clusters or building infrastructure for ML training or inference in production.
  • Hands on experience of experiment system for ML training or optimization.
  • For GPU & Network: hands-on experience with NCCL, InfiniBand or RoCE, GPUDirect RDMA, multi-node distributed training/inference debugging.
  • Comfortable owning end-to-end reliability — from hardware health and network topology up through the runtime and the application.
  • Strong debugging instincts: able to correlate symptoms across hardware counters, network telemetry, and application logs.
  • Clear communication; comfortable working with both ML engineers and hardware/network specialists.


Preferred Qualifications

  • 5 years of experience of ML infra or building production ML platforms.
  • Experience benchmarking or productionizing modern serving stacks: vLLM, SGLang, TensorRT-LLM, NVIDIA Dynamo, Triton.
  • Experience with H100 / H200 and especially B200 / Blackwell deployments.
  • Familiarity with PD disaggregation, KV cache offload, MoE serving topologies, or speculative decoding pipelines from an infrastructure standpoint.
  • Track record of contributing to open-source infra or benchmarking projects.
  • Experience publishing technical blogs or case studies from production infrastructure work.

Salary.com Estimation for Infrastructure Engineer, LLM Inference Optimization in Mountain View, CA
$128,414 to $172,630
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Infrastructure Engineer, LLM Inference Optimization?

Sign up to receive alerts about other jobs on the Infrastructure Engineer, LLM Inference Optimization career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$103,114 - $138,258
Income Estimation: 
$118,163 - $145,996
Income Estimation: 
$120,777 - $151,022
Income Estimation: 
$129,363 - $167,316
Income Estimation: 
$86,891 - $130,303
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at GMI Cloud

  • GMI Cloud Mountain View, CA
  • About US GMI Cloud is a fast-growing AI infrastructure company backed by Headline VC and one of only six cloud providers worldwide to earn NVIDIA’s prestig... more
  • 15 Days Ago

  • GMI Cloud Mountain View, CA
  • We’re looking for a hands-on, execution-driven Event & Community Intern to support our AI ecosystem initiatives across events, startup programs, and GTM ef... more
  • 2 Days Ago

  • GMI Cloud Mountain View, CA
  • About the Role We are hiring a Product Lead to own GMI’s Cluster Management Dashboard and core AI infrastructure control-planes. This role sits at the inte... more
  • 3 Days Ago

  • GMI Cloud Mountain View, CA
  • About the Role We’re looking for a Forward Deployment Engineer (FDE) to work directly with customers and partners to design, deploy, and validate Inference... more
  • 6 Days Ago


Not the job you're looking for? Here are some other Infrastructure Engineer, LLM Inference Optimization jobs in the Mountain View, CA area that may be a better fit.

  • Hippocratic AI Palo Alto, CA
  • About Us Hippocratic AI has developed a safety-focused Large Language Model (LLM) for healthcare. The company believes that a safe LLM can dramatically imp... more
  • 7 Days Ago

  • Majestic Labs ai Los Altos, CA
  • The Role In this high-impact role, you are the bridge between cutting-edge custom silicon and production-grade AI. You will own the end-to-end LLM serving ... more
  • 7 Days Ago

AI Assistant is available now!

Feel free to start your new journey!