What are the responsibilities and job description for the ML Systems Engineer – Inference Serving position at Oho Group?
About the Opportunity
A genuine ground-floor opportunity at a stealth, well-funded AI hardware startup building a custom AI SoC and full inference serving stack from the ground up. You'll work shoulder-to-shoulder with world-class hardware and software engineers, with real end-to-end ownership over how foundation models run on next-generation silicon.
What You'll Do
- You'll get the chance to be a core contributor on a small, senior team designing and building state-of-the-art inference serving and cluster scheduling capabilities for a custom AI SoC
- You'll have the opportunity to architect high-performance multi-node inference stacks, owning throughput and latency from first principles
- You'll get to design and implement advanced optimisation strategies across TP/PP/EP hybrids, continuous batching, and KV cache management at the intersection of compute, networking, and storage
- You'll have the chance to drive performance improvements directly inside leading open-source inference frameworks including vLLM, SGLang, and PyTorch
- You'll get the opportunity to build advanced cluster scheduling algorithms that push efficiency boundaries for large-scale foundation model serving
- You'll be able to engage with the open-source community directly — upstreaming optimisations and shaping the roadmap of widely used AI infrastructure projects
- You'll get to apply rigorous benchmarking, testing, and debugging practices to maintain a production-grade stack running on novel silicon
What We're Looking For
- Strong Python, C , and PyTorch fundamentals with a proven track record of shipping high-quality software in fast-moving environments
- 1 years as an active developer contributing to LLM inference serving frameworks such as vLLM or SGLang
- Deep knowledge of LLM inference internals — KV cache, batching strategies, and attention mechanisms
- Experience running and optimising large-scale workloads on heterogeneous clusters
- Strong performance analysis skills; GPU kernel development in CUDA, Triton, or ROCm is a plus
- Familiarity with networking, storage management, or distributed scheduling technologies such as Orca or LMCache is a significant plus
Education
Master's or PhD in Computer Science, Computer Engineering, Electrical Engineering, or equivalent industry experience preferred.
Salary : $250,000 - $300,000