What are the responsibilities and job description for the Senior Kubernetes Engineer - HPC / GPU position at GTN Technical Staffing?
Senior Kubernetes Engineer (GPU / AI Platforms)
Location: Dallas, TX (Hybrid)
Type: Direct Hire
• Competitive base salary performance bonus
• 100% company-paid benefits
• Relocation available
Overview
We are seeking a Senior Kubernetes Engineer to design, build, and optimize GPU-accelerated container platforms supporting large-scale HPC, AI/ML workloads, and next-generation CaaS / GPUaaS environments.
This role is focused on enabling scalable, multi-tenant compute platforms that power GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) offerings across hybrid and on-prem infrastructure. You will work at the intersection of Kubernetes and the NVIDIA ecosystem, driving performance, efficiency, and reliability for high-throughput, GPU-intensive workloads.
The ideal candidate brings deep hands-on experience building production-grade Kubernetes platforms for AI and HPC workloads, along with strong development skills and a passion for high-performance, distributed systems at scale.
Key Responsibilities
Kubernetes Platform Engineering
- Architect, deploy, and operate Kubernetes clusters optimized for GPU-intensive and multi-tenant workloads
- Design platforms supporting CaaS / GPUaaS delivery models, ensuring scalability, resilience, and performance
- Leverage NVIDIA GPU Operator, Network Operator, and DCGM for cluster management and observability
GPU Enablement & Scheduling
- Integrate NVIDIA device plugins, MIG, and GPU sharing capabilities into Kubernetes scheduling frameworks
- Optimize GPU utilization and workload placement using scheduler extensions (kube-scheduler plugins, Slurm, Volcano)
- Support AI/ML training, LLM workloads, and scientific computing at scale
Automation & Platform Development
- Develop and maintain Kubernetes operators and custom controllers
- Automate platform provisioning and lifecycle management using Go or Python
- Implement Infrastructure-as-Code using Terraform, Helm, and Kustomize
Observability & Performance Optimization
- Implement monitoring and telemetry using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry
- Drive performance tuning, capacity planning, and optimization across GPU-enabled clusters
- Support incident response and ensure production readiness
Security & Multi-Tenancy
- Design secure, multi-tenant Kubernetes environments using RBAC, namespaces, and policy enforcement (OPA, Gatekeeper)
- Ensure workload isolation and governance across shared GPU infrastructure
- Support secure platform operations across CaaS / GPUaaS environments
DevOps & CI/CD
- Build and maintain CI/CD pipelines using GitOps tools such as ArgoCD and FluxCD
- Support continuous delivery and lifecycle management of Kubernetes-based platforms
Cross-Functional Collaboration
- Partner with HPC, AI/ML, DevOps, and platform engineering teams to support high-performance workloads
- Collaborate on platform architecture, optimization strategies, and operational best practices
Required Experience
- Extensive experience operating Kubernetes in production-scale environments
- Strong experience supporting HPC, AI/ML, or GPU-accelerated infrastructure
- Experience designing or supporting CaaS, GPUaaS, or multi-tenant platform environments
- Deep expertise with NVIDIA and Kubernetes ecosystems including GPU Operator, device plugins, NVML, MIG, and DCGM
- Strong understanding of Kubernetes internals (CRDs, RBAC, controllers, scheduler extensions)
- Proficiency in Go or Python for automation and operator development
- Experience supporting GPU-intensive workloads (LLMs, AI/ML pipelines, HPC applications)
- Hands-on experience with Helm, Kustomize, and GitOps workflows
Technical Skills
- Monitoring and observability: Prometheus, Grafana, DCGM Exporter, OpenTelemetry
- Networking: CNI plugins (NVIDIA CNI, Multus), service networking, cluster networking concepts
- Infrastructure-as-Code: Terraform, Helm, Kustomize
- CI/CD and GitOps practices
Preferred Experience
- Experience with container runtimes (containerd, CRI-O, NVIDIA Container Toolkit)
- Exposure to advanced networking solutions such as Cilium
- Contributions to open-source projects within Kubernetes or NVIDIA ecosystems
- Experience working in large-scale HPC or AI infrastructure environments
Additional Requirements
- This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship.
- We are unable to sponsor or take over sponsorship of employment visas at this time.
Salary : $150,000 - $230,000