What are the responsibilities and job description for the Lead MLOps / AI Platform Engineer position at SATCON Inc?
Job Description: Lead MLOps / AI Platform Engineer
Location: Charlotte, NC
Duration: Long Term
Visa Type: & Candidates
Role Overview
We are seeking a highly skilled Lead MLOps / AI Platform Engineer to design, build, and optimize our next-generation Generative AI and Large Language Model (LLM) infrastructure. This role is pivotal in bridging the gap between cutting-edge AI research and robust production deployment. You will be responsible for orchestrating high-performance GPU environments (specifically leveraging Nvidia H200s), optimizing LLM inference, and maintaining enterprise-grade infrastructure across both Cloud (Google Cloud Platform/Azure) and On-Premise environments.
Key Responsibilities
- AI Inference Optimization & Serving
- Deploy, scale, and manage large-scale language models using advanced inference frameworks such as vLLM, TensorRT-LLM, SGLang, and Triton Inference Server.
- Implement and fine-tune performance optimization strategies, including Continuous Batching and advanced Parallelism techniques.
- Conduct load testing, benchmarking, and profiling of LLM deployments using GuideLLM and Locust to ensure optimal latency and throughput.
- Cloud & Infrastructure Orchestration
- Architect and maintain scalable, secure infrastructure on Google Cloud Platform and Azure using Infrastructure as Code (Terraform).
- Design and execute Cloud Networking, Landing Zones, and Organization Policies/Governance.
- Manage secrets and secure workloads utilizing HashiCorp Vault.
- Develop and champion Internal Developer Portals to streamline workflows for data science and product teams.
- On-Premise & Kubernetes Engineering
- Orchestrate ML workloads on Kubernetes, utilizing KServe, OpenShift AI / OpenShift Functions, and Helm charts/Operators.
- Manage compute clusters with a heavy focus on advanced GPU Orchestration (Nvidia H200 ecosystems).
- Demonstrate deep hands-on expertise in architecture and "know-how to unfold an LLM" into highly constrained or custom on-premise hardware setups.
- Observability & SRE
- Implement end-to-end ML Observability and monitoring frameworks using Arize AI.
- Establish Site Reliability Engineering (SRE) best practices, maintaining strict SLOs/SLIs for model deployment pipelines and production APIs.
Required Skills & Qualifications
Core AI / MLOps Stack:
- Inference Engines: vLLM, TensorRT-LLM, Triton Inference Server, SGLang
- ML Frameworks/Ops: KServe, OpenShift AI, Arize AI, GenAI Platforms, RAG architecture
- Performance & Testing: GuideLLM, Locust, Continuous Batching, Parallelism optimization
- Infrastructure & Cloud Stack:
- Cloud Providers: Google Cloud Platform (Google Cloud Platform), Microsoft Azure
- Containerization & Orchestration: Kubernetes, OpenShift, Helm/Operators, GPU Orchestration
- IaC & Automation: Terraform, Python
- Security & Networking: HashiCorp Vault, Landing Zones, Org Policy & Governance
- Hardware Sanity Check:
- Mandatory Experience: Direct, hands-on experience working with Nvidia H200 GPUs and optimizing workloads specifically for this architecture.
Salary : $60 - $70