What are the responsibilities and job description for the Engineering Manager, HPC Platform position at GTN Technical Staffing?

Engineering Manager, HPC Platform

Location: Dallas, TX (Hybrid)

Type: Direct Hire

• Competitive base salary performance bonus

• 100% company-paid benefits

• Relocation available

Overview

We are seeking an Engineering Manager, HPC Platform to lead the design, scaling, and operational excellence of a bare-metal Kubernetes platform powering HPC, AI/ML workloads, and next-generation CaaS / GPUaaS environments.

This organization operates at the forefront of high-performance computing and AI infrastructure, building platforms that support large-scale research, simulation, and production workloads. This role will lead a team responsible for delivering multi-tenant, GPU-accelerated compute platforms, enabling GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities across distributed data center environments.

This is a hands-on leadership role focused on platform performance, reliability, and automation. You will define the technical roadmap, guide system architecture, and ensure the platform delivers high-throughput, low-latency performance at scale for distributed HPC and AI workloads.

Key Responsibilities

Leadership & Team Development

Lead, mentor, and grow a team of engineers building and scaling HPC and Kubernetes-based platform infrastructure
Foster a culture of ownership, operational excellence, and continuous improvement
Drive alignment across engineering, platform, and infrastructure teams

Platform Architecture & Engineering

Architect and scale a bare-metal Kubernetes platform supporting HPC, AI/ML, and CaaS / GPUaaS workloads
Design and optimize multi-tenant GPU and CPU environments, including workload isolation, scheduling, and resource management
Define architecture patterns for high-performance, distributed compute platforms

GPU Platform & Workload Optimization

Optimize GPU utilization, scheduling, and performance across large-scale clusters
Support AI/ML training, LLM workloads, and scientific computing at scale
Ensure efficient workload orchestration across Kubernetes and HPC scheduling environments

Automation, SRE & Platform Operations

Drive automation using Infrastructure-as-Code (Terraform, Ansible) and CI/CD pipelines
Implement SRE best practices for reliability, observability, and incident response
Build scalable operational frameworks supporting large, multi-tenant compute environments

Performance, Reliability & Capacity Planning

Own platform performance, uptime, and scalability across thousands of nodes
Define and track KPIs for system health, utilization, and performance
Lead capacity planning and forecasting aligned with rapid compute growth

Cross-Functional Collaboration

Partner with research, storage, and networking teams to integrate distributed filesystems and high-speed interconnects (InfiniBand, RoCE)
Collaborate with hardware and software vendors to improve platform capabilities and deployment efficiency
Align platform architecture with evolving HPC, AI, and GPUaaS / CaaS delivery models

Required Experience

7 years of experience in infrastructure, platform, or SRE engineering, with 2 years in a technical leadership role
Proven experience operating Kubernetes environments for HPC, AI/ML, or GPU-accelerated workloads
Experience designing or supporting CaaS, GPUaaS, or multi-tenant compute platforms
Deep expertise in Linux systems, networking, and performance optimization on bare-metal infrastructure
Experience managing large-scale distributed clusters and integrating storage and high-performance networking
Strong experience with automation tools (Terraform, Ansible) and observability platforms (Prometheus, Grafana, Loki)
Strong communication and leadership skills with the ability to translate technical direction into execution

Preferred Experience

Familiarity with HPC schedulers (Slurm, Flux) and hybrid scheduling models
Experience with container runtimes (containerd, CRI-O) and Kubernetes internals
Contributions to open-source Kubernetes, HPC, or ML infrastructure projects
Experience operating in hyperscale or AI-focused infrastructure environments

Additional Requirements

This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship.
We are unable to sponsor or take over sponsorship of employment visas at this time.

Apply for this job

Receive alerts for other Engineering Manager, HPC Platform job openings

Engineering Manager, HPC Platform

What are the responsibilities and job description for the Engineering Manager, HPC Platform position at GTN Technical Staffing?

What is the career path for a Engineering Manager, HPC Platform?

Job openings at GTN Technical Staffing

Not the job you're looking for? Here are some other Engineering Manager, HPC Platform jobs in the Dallas, TX area that may be a better fit.

We don't have any other Engineering Manager, HPC Platform jobs in the Dallas, TX area right now.

AI Assistant is available now!