What are the responsibilities and job description for the Senior Compute Platform Engineer position at GTN Technical Staffing?
Compute Platform Engineer
Location: Dallas, TX (Hybrid)
Type: Direct Hire
• Competitive base salary performance bonus
• 100% company-paid benefits
Overview
We are seeking a Compute Platform Engineer to support the reliability, performance, and operational health of large-scale, high-performance compute infrastructure supporting critical research and production workloads.
This role is responsible for maintaining and troubleshooting CPU and GPU-based compute platforms, ensuring consistent performance at scale, and driving operational excellence across the environment. The position works closely with platform engineering, infrastructure, operations teams, and hardware vendors to support a stable and highly available compute ecosystem.
The ideal candidate brings strong hands-on experience with HPC or AI infrastructure, deep knowledge of server hardware, and a proactive approach to troubleshooting, automation, and continuous improvement.
Key Responsibilities
Compute Infrastructure Engineering
• Design, configure, and manage high-performance compute infrastructure composed of CPU and GPU nodes
• Support large-scale HPC and AI platforms, ensuring systems are stable, performant, and production-ready
• Perform diagnostics, tuning, and capacity planning to support efficient scale-out of compute environments
Hardware Reliability & Lifecycle Management
• Manage full firmware and BIOS lifecycle across compute infrastructure, including baselines, validation, rollout, and compliance
• Troubleshoot complex hardware issues across CPU, GPU, DPU, NVSwitch, NICs, memory, PSU, and BMC components
• Drive root cause analysis and implement solutions to improve system reliability and reduce recovery time
• Analyze hardware lifecycle processes and recommend improvements for optimization and efficiency
Automation & Platform Operations
• Automate health checks, onboarding workflows, and operational processes to improve deployment efficiency
• Leverage Infrastructure-as-Code (IaC) methodologies to enable scalable and repeatable infrastructure management
• Recommend and implement tooling and process improvements to enhance platform operations
Vendor & Cross-Functional Collaboration
• Collaborate with hardware vendors to resolve firmware and system issues, providing detailed diagnostics, logs, and impact analysis
• Work closely with infrastructure, platform, and operations teams to align on system performance and reliability goals
• Support integration of hardware improvements across the broader environment
Monitoring, Performance & Security
• Monitor hardware performance and identify opportunities for optimization
• Implement best practices for platform security and system hardening
• Ensure adherence to operational standards and data center processes
Technical Leadership
• Act as a subject matter expert for compute infrastructure and hardware-related issues
• Mentor junior engineers and contribute to a culture of continuous improvement and technical excellence
Required Experience
• 3 years of hands-on experience supporting large-scale compute platforms, HPC, or AI infrastructure
• Strong experience with HPE server platforms such as ProLiant and Apollo
• Experience working with NVIDIA GPUs, including A100, H100/H200, or similar
• Solid understanding of server architecture including UEFI/BIOS, PCIe devices, and out-of-band management systems (iLO, BMC)
• Proven ability to troubleshoot complex hardware issues and coordinate with vendors for resolution
• Experience with Linux in high-performance or latency-sensitive environments
• Familiarity with core networking concepts including DNS, DHCP, VLANs, switching, and routing
• Experience working within data center environments and operational processes
Technical Skills
• Experience with automation tools such as Ansible, Terraform, and CI/CD pipelines
• Exposure to Infrastructure-as-Code (IaC) practices
• Working knowledge of Kubernetes and/or OpenStack (preferred)
• Strong problem-solving and analytical skills with the ability to operate in complex environments
Preferred Experience
• Experience supporting AI platforms or next-generation GPU architectures
• Exposure to large-scale distributed compute environments
• Experience working in mission-critical or high-availability infrastructure environments
Salary : $160,000 - $175,000