What are the responsibilities and job description for the Machine Learning Engineer position at Oho Group?
Senior/Lead Inference Systems Engineer (ML/AI)
About the Role
We're looking for a Senior or Lead Inference Systems Engineer to help build the next generation of large-scale AI inference infrastructure. Working alongside world-class hardware and software engineers, you'll play a key role in developing high-performance inference serving systems and cluster scheduling technologies that maximise the efficiency of modern foundation models.
This is an exciting opportunity to work at the cutting edge of distributed AI systems, helping shape how large language models are deployed, scaled, and optimised across heterogeneous compute environments.
Key Responsibilities
- You will help design, build, and optimise large-scale inference serving platforms capable of delivering industry-leading throughput, latency, and efficiency.
- You'll get the chance to develop and refine multi-node inference strategies that maximise performance across distributed compute clusters.
- You will work on advanced optimisation techniques including tensor parallelism, pipeline parallelism, expert parallelism, continuous batching, and KV cache management.
- This is an excellent opportunity for you to collaborate with hardware and systems teams to optimise workloads across compute, networking, and storage infrastructure.
- You'll be responsible for driving performance improvements across leading inference frameworks such as vLLM, SGLang, and PyTorch.
- You will contribute to the design and implementation of cluster scheduling systems that intelligently allocate resources and maximise utilisation at scale.
- You'll get the opportunity to engage with the open-source community, contributing optimisations upstream and helping influence the future direction of widely adopted AI infrastructure projects.
- You will help establish best practices around benchmarking, testing, debugging, and performance analysis to ensure a highly reliable production-grade platform.
Required Qualifications
- You'll need strong software engineering experience with Python, C , and PyTorch.
- You should have experience developing or contributing to modern LLM inference serving frameworks such as vLLM, SGLang, or equivalent technologies.
- You must possess a deep understanding of large language model inference, including attention mechanisms, batching strategies, KV cache management, and serving optimisation techniques.
- You'll need hands-on experience deploying, operating, or optimising large-scale distributed workloads across multi-node compute environments.
- Experience with performance profiling, benchmarking, debugging, and system-level optimisation is essential.
- You should be comfortable working in fast-paced engineering environments and collaborating across multiple technical disciplines.
Preferred Qualifications
- Experience with distributed scheduling systems, cluster orchestration, resource management, or workload optimisation technologies.
- Exposure to networking, storage systems, distributed caching, or infrastructure platforms supporting large-scale AI deployments.
- Experience working with technologies such as Orca, LMCache, or similar distributed inference optimisation frameworks.
- Knowledge of GPU programming and performance optimisation using CUDA, Triton, ROCm, or related technologies.
- Understanding of AI accelerator architectures and large-scale heterogeneous computing environments.
- Experience contributing to open-source AI infrastructure projects.
Education
You should be educated to Master's or PhD level in Computer Science, Computer Engineering, Electrical Engineering, or a related technical discipline, or possess equivalent industry experience.
Salary : $250,000 - $350,000