What are the responsibilities and job description for the Senior Kubernetes Platform Engineer position at Jobs via Dice?
Dice is the leading career destination for tech experts at every stage of their careers. Our client, XPath Solutions LLC, is seeking the following. Apply via Dice today!
Senior Kubernetes Platform Engineer
ML / GenAI Infrastructure | Terraform | Cloud-Native
In Person Interview is Non-Negotiable
Location: Charlotte, NC - On-Site/Hybrid
Employment Type: Contract-to-Hire
Experience: 7–12 Years (5 years hands-on Kubernetes)
Industry: Enterprise AI / Cloud Infrastructure
⸻
About The Role
We are looking for a Senior Kubernetes Platform Engineer to design, build, and operate mission-critical Kubernetes infrastructure that powers large-scale Machine Learning (ML) and Generative AI (GenAI) workloads.
This is not a standard Kubernetes admin role — you will act as a subject matter expert, driving architecture decisions across scheduling, networking, security, storage, and multi-tenancy. You will work closely with ML engineers, researchers, and application teams to build scalable, GPU-optimized platforms that accelerate AI innovation.
⸻
Key Responsibilities
Kubernetes Platform Engineering
ML / GenAI Infrastructure
Infrastructure as Code (Terraform)
Observability, Security & Reliability
Required Skills & Technologies
Required Qualifications
Preferred Qualifications
What You’ll Deliver (First 90 Days)
Who You Are
Why Join Us
This role is ideal for engineers passionate about Kubernetes and AI infrastructure who want to build the backbone of next-generation enterprise AI platforms.
Senior Kubernetes Platform Engineer
ML / GenAI Infrastructure | Terraform | Cloud-Native
In Person Interview is Non-Negotiable
Location: Charlotte, NC - On-Site/Hybrid
Employment Type: Contract-to-Hire
Experience: 7–12 Years (5 years hands-on Kubernetes)
Industry: Enterprise AI / Cloud Infrastructure
⸻
About The Role
We are looking for a Senior Kubernetes Platform Engineer to design, build, and operate mission-critical Kubernetes infrastructure that powers large-scale Machine Learning (ML) and Generative AI (GenAI) workloads.
This is not a standard Kubernetes admin role — you will act as a subject matter expert, driving architecture decisions across scheduling, networking, security, storage, and multi-tenancy. You will work closely with ML engineers, researchers, and application teams to build scalable, GPU-optimized platforms that accelerate AI innovation.
⸻
Key Responsibilities
Kubernetes Platform Engineering
- Design, deploy, and manage multi-cluster Kubernetes environments (EKS, GKE, AKS)
- Build advanced Kubernetes components including CRDs, Operators, admission webhooks, and custom schedulers
- Optimize Kubernetes for GPU workloads (NVIDIA device plugins, MIG, time-slicing)
- Implement autoscaling solutions (HPA, VPA, KEDA, Cluster Autoscaler)
- Enforce security using RBAC, OPA/Gatekeeper, and Pod Security Standards
- Manage service mesh (Istio / Linkerd) for secure and observable microservices
- Configure networking (Cilium, Calico), ingress controllers, and network policies
- Lead cluster lifecycle management (upgrades, backups, disaster recovery)
- Package platform components using Helm and Kustomize
ML / GenAI Infrastructure
- Design ML pipelines using Kubeflow, Argo Workflows, or Ray
- Build scalable model serving platforms (KServe, Triton, TorchServe, vLLM)
- Optimize distributed compute using Ray on Kubernetes
- Design storage solutions for ML datasets and artifacts (EFS, GCS, NFS, etc.)
- Enable GPU-backed environments (JupyterHub, Kubeflow Notebooks)
- Deploy and manage vector databases for RAG applications
- Optimize LLM inference (batching, caching, multi-GPU scaling)
Infrastructure as Code (Terraform)
- Develop and maintain reusable Terraform modules for cloud infrastructure
- Implement remote state management and multi-environment workflows
- Enforce best practices: versioning, drift detection, policy-as-code
- Integrate Terraform into CI/CD pipelines and GitOps workflows
- Use tools like Atlantis or Terraform Cloud for automated deployments
Observability, Security & Reliability
- Build observability stack (Prometheus, Grafana, Loki, Jaeger/Tempo)
- Implement audit logging and runtime security (Falco, SIEM integration)
- Define SLOs/SLIs and maintain platform reliability
- Perform GPU capacity planning and cost optimization
- Lead incident response and post-mortem analysis
Required Skills & Technologies
- Kubernetes (Expert level)
- Terraform (Advanced)
- Helm / Kustomize
- AWS / Google Cloud Platform / Azure (EKS, GKE, AKS)
- Istio / Linkerd
- Argo Workflows / Kubeflow / Ray
- KServe / Triton
- Prometheus / Grafana
- Cilium / Calico
- OPA / Gatekeeper
- NVIDIA GPU Operator
- Docker / containerd
- GitOps tools (ArgoCD / Flux)
- Python / Go / Bash
- Linux systems and networking
Required Qualifications
- 7 years in cloud/platform engineering
- 5 years hands-on Kubernetes in production
- Deep understanding of Kubernetes internals (control plane, CNI, CSI, etc.)
- Experience running GPU-based ML/AI workloads at scale
- Strong Terraform expertise (modules, CI/CD, multi-cloud)
- Experience with ML orchestration tools (Kubeflow, Argo, or Ray)
- Proficiency in at least one programming language (Python, Go, or Bash)
- Experience with GitOps and secure container practices
Preferred Qualifications
- CKA (Certified Kubernetes Administrator) — Required
- CKS (Certified Kubernetes Security Specialist) — Preferred
- CKAD certification
- Cloud DevOps certifications (AWS / Google Cloud Platform)
- Terraform certification
- Experience with Crossplane or multi-cluster management
- Familiarity with eBPF tools (Hubble, Pixie)
- Contributions to CNCF or open-source Kubernetes ecosystem
What You’ll Deliver (First 90 Days)
- Day 30: Audit existing Kubernetes clusters and deliver a gap analysis
- Day 60: Implement Terraform-managed clusters with security and observability
- Day 90: Deploy production-ready model serving platform with SLO dashboards
Who You Are
- A systems thinker with a strong platform mindset
- Proactive and automation-driven
- Comfortable working cross-functionally with ML and engineering teams
- Influential communicator who can drive architecture decisions
- Security-focused and reliability-driven
Why Join Us
This role is ideal for engineers passionate about Kubernetes and AI infrastructure who want to build the backbone of next-generation enterprise AI platforms.