What are the responsibilities and job description for the On-prem Platform Engineer position at Ampstek?
Role: On-prem Platform Engineer
Location: Charlotte, NC (Onsite)
Job Type: Long Terms Contract
Job Description:
Key Skills:
Must-Have Skills (Mandatory Keywords)
LLM Inference & Optimization
• vLLM, TensorRT-LLM, Triton Inference Server, SGLang
• Inference optimization techniques:
o Continuous batching
o Speculative decoding
o KV cache / Prefix caching
• Model optimization:
o FP8, AWQ, GPTQ
Distributed & GPU Systems
• Tensor parallelism and large model scaling
• CUDA, NCCL, GPU architecture
• GPU partitioning & optimization (MIG)
Kubernetes & ML Serving
• Kubernetes-based ML serving platforms
• KServe, OpenShift AI
• Helm charts, Operators, platform automation
GPU Orchestration
• Run:AI or similar GPU scheduling/orchestration platforms
• Multi-tenant GPU workload management
Platform Engineering
• Experience building internal AI/ML platforms (on-prem or hybrid)
• Strong automation and system design mindset
Observability & Performance
• Prometheus, Grafana
• ML observability (model latency, throughput, drift, resource utilization)
• Performance benchmarking and tuning
Good to Have / Preferred Skills:
• Experience with LLMOps / GenAI pipelines
• Exposure to hybrid cloud (on-prem GCP/Azure integration)
• Familiarity with Inferentia / alternative accelerators
• Knowledge of service mesh / networking in GPU clusters
• Build, configure, and operate on prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads.
• Design and optimize high performance inference stacks using vLLM, TensorRT LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).
• Manage GPU orchestration and capacity using Run:AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.
• Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.
• Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.
• Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for GenAI services.
• Collaborate with ML and research teams to onboard new models, tune inference performance, and productionize GenAI use cases.
Thanks and regards,
Deepa Maurya | Technical Recruiter - US Staffing
Email: deepa.m@ampstek.com | Desk: (609) 527-8971
Ampstek LLC – Global IT Partner | www.ampstek.com