What are the responsibilities and job description for the Engineering Manager, HPC Platform position at GTN Technical Staffing?
Engineering Manager, HPC Platform
Location: Dallas, TX (Hybrid)
Type: Direct Hire
• Competitive base salary performance bonus
• 100% company-paid benefits
• Relocation available
Overview
We are seeking an Engineering Manager, HPC Platform to lead the design, scaling, and operational excellence of a bare-metal Kubernetes platform powering HPC, AI/ML workloads, and next-generation CaaS / GPUaaS environments.
This organization operates at the forefront of high-performance computing and AI infrastructure, building platforms that support large-scale research, simulation, and production workloads. This role will lead a team responsible for delivering multi-tenant, GPU-accelerated compute platforms, enabling GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities across distributed data center environments.
This is a hands-on leadership role focused on platform performance, reliability, and automation. You will define the technical roadmap, guide system architecture, and ensure the platform delivers high-throughput, low-latency performance at scale for distributed HPC and AI workloads.
Key Responsibilities
Leadership & Team Development
- Lead, mentor, and grow a team of engineers building and scaling HPC and Kubernetes-based platform infrastructure
- Foster a culture of ownership, operational excellence, and continuous improvement
- Drive alignment across engineering, platform, and infrastructure teams
Platform Architecture & Engineering
- Architect and scale a bare-metal Kubernetes platform supporting HPC, AI/ML, and CaaS / GPUaaS workloads
- Design and optimize multi-tenant GPU and CPU environments, including workload isolation, scheduling, and resource management
- Define architecture patterns for high-performance, distributed compute platforms
GPU Platform & Workload Optimization
- Optimize GPU utilization, scheduling, and performance across large-scale clusters
- Support AI/ML training, LLM workloads, and scientific computing at scale
- Ensure efficient workload orchestration across Kubernetes and HPC scheduling environments
Automation, SRE & Platform Operations
- Drive automation using Infrastructure-as-Code (Terraform, Ansible) and CI/CD pipelines
- Implement SRE best practices for reliability, observability, and incident response
- Build scalable operational frameworks supporting large, multi-tenant compute environments
Performance, Reliability & Capacity Planning
- Own platform performance, uptime, and scalability across thousands of nodes
- Define and track KPIs for system health, utilization, and performance
- Lead capacity planning and forecasting aligned with rapid compute growth
Cross-Functional Collaboration
- Partner with research, storage, and networking teams to integrate distributed filesystems and high-speed interconnects (InfiniBand, RoCE)
- Collaborate with hardware and software vendors to improve platform capabilities and deployment efficiency
- Align platform architecture with evolving HPC, AI, and GPUaaS / CaaS delivery models
Required Experience
- 7 years of experience in infrastructure, platform, or SRE engineering, with 2 years in a technical leadership role
- Proven experience operating Kubernetes environments for HPC, AI/ML, or GPU-accelerated workloads
- Experience designing or supporting CaaS, GPUaaS, or multi-tenant compute platforms
- Deep expertise in Linux systems, networking, and performance optimization on bare-metal infrastructure
- Experience managing large-scale distributed clusters and integrating storage and high-performance networking
- Strong experience with automation tools (Terraform, Ansible) and observability platforms (Prometheus, Grafana, Loki)
- Strong communication and leadership skills with the ability to translate technical direction into execution
Preferred Experience
- Familiarity with HPC schedulers (Slurm, Flux) and hybrid scheduling models
- Experience with container runtimes (containerd, CRI-O) and Kubernetes internals
- Contributions to open-source Kubernetes, HPC, or ML infrastructure projects
- Experience operating in hyperscale or AI-focused infrastructure environments
Additional Requirements
- This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship.
- We are unable to sponsor or take over sponsorship of employment visas at this time.