What are the responsibilities and job description for the ML Systems Engineer – Inference Serving position at Oho Group?

About the Opportunity

A genuine ground-floor opportunity at a stealth, well-funded AI hardware startup building a custom AI SoC and full inference serving stack from the ground up. You'll work shoulder-to-shoulder with world-class hardware and software engineers, with real end-to-end ownership over how foundation models run on next-generation silicon.

What You'll Do

You'll get the chance to be a core contributor on a small, senior team designing and building state-of-the-art inference serving and cluster scheduling capabilities for a custom AI SoC
You'll have the opportunity to architect high-performance multi-node inference stacks, owning throughput and latency from first principles
You'll get to design and implement advanced optimisation strategies across TP/PP/EP hybrids, continuous batching, and KV cache management at the intersection of compute, networking, and storage
You'll have the chance to drive performance improvements directly inside leading open-source inference frameworks including vLLM, SGLang, and PyTorch
You'll get the opportunity to build advanced cluster scheduling algorithms that push efficiency boundaries for large-scale foundation model serving
You'll be able to engage with the open-source community directly — upstreaming optimisations and shaping the roadmap of widely used AI infrastructure projects
You'll get to apply rigorous benchmarking, testing, and debugging practices to maintain a production-grade stack running on novel silicon

What We're Looking For

Strong Python, C , and PyTorch fundamentals with a proven track record of shipping high-quality software in fast-moving environments
1 years as an active developer contributing to LLM inference serving frameworks such as vLLM or SGLang
Deep knowledge of LLM inference internals — KV cache, batching strategies, and attention mechanisms
Experience running and optimising large-scale workloads on heterogeneous clusters
Strong performance analysis skills; GPU kernel development in CUDA, Triton, or ROCm is a plus
Familiarity with networking, storage management, or distributed scheduling technologies such as Orca or LMCache is a significant plus

Education

Master's or PhD in Computer Science, Computer Engineering, Electrical Engineering, or equivalent industry experience preferred.

Salary : $250,000 - $300,000

Apply for this job

Receive alerts for other ML Systems Engineer – Inference Serving job openings

ML Systems Engineer – Inference Serving

What are the responsibilities and job description for the ML Systems Engineer – Inference Serving position at Oho Group?

What is the career path for a ML Systems Engineer – Inference Serving?

Job openings at Oho Group

Not the job you're looking for? Here are some other ML Systems Engineer – Inference Serving jobs in the San Jose, CA area that may be a better fit.

We don't have any other ML Systems Engineer – Inference Serving jobs in the San Jose, CA area right now.

AI Assistant is available now!