What are the responsibilities and job description for the Senior Kubernetes Engineer - HPC / GPU position at GTN Technical Staffing?

Senior Kubernetes Engineer (GPU / AI Platforms)

Location: Dallas, TX (Hybrid)

Type: Direct Hire

• Competitive base salary performance bonus

• 100% company-paid benefits

• Relocation available

Overview

We are seeking a Senior Kubernetes Engineer to design, build, and optimize GPU-accelerated container platforms supporting large-scale HPC, AI/ML workloads, and next-generation CaaS / GPUaaS environments.

This role is focused on enabling scalable, multi-tenant compute platforms that power GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) offerings across hybrid and on-prem infrastructure. You will work at the intersection of Kubernetes and the NVIDIA ecosystem, driving performance, efficiency, and reliability for high-throughput, GPU-intensive workloads.

The ideal candidate brings deep hands-on experience building production-grade Kubernetes platforms for AI and HPC workloads, along with strong development skills and a passion for high-performance, distributed systems at scale.

Key Responsibilities

Kubernetes Platform Engineering

Architect, deploy, and operate Kubernetes clusters optimized for GPU-intensive and multi-tenant workloads
Design platforms supporting CaaS / GPUaaS delivery models, ensuring scalability, resilience, and performance
Leverage NVIDIA GPU Operator, Network Operator, and DCGM for cluster management and observability

GPU Enablement & Scheduling

Integrate NVIDIA device plugins, MIG, and GPU sharing capabilities into Kubernetes scheduling frameworks
Optimize GPU utilization and workload placement using scheduler extensions (kube-scheduler plugins, Slurm, Volcano)
Support AI/ML training, LLM workloads, and scientific computing at scale

Automation & Platform Development

Develop and maintain Kubernetes operators and custom controllers
Automate platform provisioning and lifecycle management using Go or Python
Implement Infrastructure-as-Code using Terraform, Helm, and Kustomize

Observability & Performance Optimization

Implement monitoring and telemetry using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry
Drive performance tuning, capacity planning, and optimization across GPU-enabled clusters
Support incident response and ensure production readiness

Security & Multi-Tenancy

Design secure, multi-tenant Kubernetes environments using RBAC, namespaces, and policy enforcement (OPA, Gatekeeper)
Ensure workload isolation and governance across shared GPU infrastructure
Support secure platform operations across CaaS / GPUaaS environments

DevOps & CI/CD

Build and maintain CI/CD pipelines using GitOps tools such as ArgoCD and FluxCD
Support continuous delivery and lifecycle management of Kubernetes-based platforms

Cross-Functional Collaboration

Partner with HPC, AI/ML, DevOps, and platform engineering teams to support high-performance workloads
Collaborate on platform architecture, optimization strategies, and operational best practices

Required Experience

Extensive experience operating Kubernetes in production-scale environments
Strong experience supporting HPC, AI/ML, or GPU-accelerated infrastructure
Experience designing or supporting CaaS, GPUaaS, or multi-tenant platform environments
Deep expertise with NVIDIA and Kubernetes ecosystems including GPU Operator, device plugins, NVML, MIG, and DCGM
Strong understanding of Kubernetes internals (CRDs, RBAC, controllers, scheduler extensions)
Proficiency in Go or Python for automation and operator development
Experience supporting GPU-intensive workloads (LLMs, AI/ML pipelines, HPC applications)
Hands-on experience with Helm, Kustomize, and GitOps workflows

Technical Skills

Monitoring and observability: Prometheus, Grafana, DCGM Exporter, OpenTelemetry
Networking: CNI plugins (NVIDIA CNI, Multus), service networking, cluster networking concepts
Infrastructure-as-Code: Terraform, Helm, Kustomize
CI/CD and GitOps practices

Preferred Experience

Experience with container runtimes (containerd, CRI-O, NVIDIA Container Toolkit)
Exposure to advanced networking solutions such as Cilium
Contributions to open-source projects within Kubernetes or NVIDIA ecosystems
Experience working in large-scale HPC or AI infrastructure environments

Additional Requirements

This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship.
We are unable to sponsor or take over sponsorship of employment visas at this time.

Salary : $150,000 - $230,000

Apply for this job

Receive alerts for other Senior Kubernetes Engineer - HPC / GPU job openings

Senior Kubernetes Engineer - HPC / GPU

What are the responsibilities and job description for the Senior Kubernetes Engineer - HPC / GPU position at GTN Technical Staffing?

What is the career path for a Senior Kubernetes Engineer - HPC / GPU?

Job openings at GTN Technical Staffing

Not the job you're looking for? Here are some other Senior Kubernetes Engineer - HPC / GPU jobs in the Dallas, TX area that may be a better fit.

We don't have any other Senior Kubernetes Engineer - HPC / GPU jobs in the Dallas, TX area right now.

AI Assistant is available now!