What are the responsibilities and job description for the Senior GPU Platform Engineer - AI Infrastructure Operations position at WB Solutions LLC?
Join our team to operate and support cutting-edge GPU infrastructure powering AI and high-performance computing workloads for a leading global hyper scale cloud provider. In this hands-on role, you'll manage the full lifecycle of NVIDIA GPU platforms from bring-up to break/fix while ensuring optimal performance for advanced AI applications.
Title: Senior GPU Platform Engineer - AI Infrastructure Operations
Contract length : 1 year
Location: Redmond, WA onsite 4 days a week
MUST HAVE SKILLS:
- Configuration Management
- GPGPU/GPU
- Hardware Troubleshooting
- Infrastructure & Operations
- Infrastructure Automation and Orchestration
- Linux Administration
Responsibilities:
- Operate and maintain production GPU and bare-metal compute platforms with hands-on hardware management
- Perform physical infrastructure tasks including rack/stack, cabling, power validation, and system bring-up
- Diagnose hardware faults, replace failed components, and coordinate vendor support for complex issues
- Install and configure Linux operating systems with GPU-specific drivers and software stacks
- Execute platform validation using diagnostic tools to ensure GPU health, stability, and performance
- Provision bare-metal systems through automated workflows while troubleshooting configuration issues
- Apply firmware, BIOS, and platform configuration changes following standardized change processes
Requirements:
- 5 years professional experience supporting production server infrastructure in data center environments
- Strong Linux administration skills with ability to independently troubleshoot system-level issues
- Hands-on experience with physical server hardware including diagnostics and component replacement
- Familiarity with GPU platforms, preferably NVIDIA, and associated drivers and software stacks
- Experience working in structured, change-controlled production environments
- Knowledge of infrastructure monitoring tools and alert response procedures
- Excellent communication skills with ability to collaborate across operations and engineering teams