What are the responsibilities and job description for the Bare Metal Support Engineer, Mid-Level position at Jobright.ai?
Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a staffing agency. Jobright does not hire directly for these positions. We connect you with verified openings from employers you can trust.
Job Summary:
CoreWeave is the AI Hyperscaler™, delivering cutting-edge cloud services for accelerated computing. They are seeking a Bare Metal Support Engineer to support, operate, and maintain their extensive GPU fleet, ensuring reliability and performance while collaborating with customers and engineering teams.
Responsibilities:
• Provide high-level support for customers utilizing bare-metal GPU fleets on CoreWeave Cloud.
• Diagnose, triage, and investigate reported customer issues and high-priority incidents, identifying root causes and escalating when necessary.
• Develop a deep understanding of customer workloads and use cases to provide tailored technical support.
• Coordinate remote troubleshooting and hardware interventions with Data Center Technicians.
• Create and maintain internal documentation, including troubleshooting guides, best practices, and knowledge base articles.
• Participate in an on-call rotation to support production clusters and ensure operational reliability.
• Collaborate with engineering teams to improve hardware reliability, software stability, and system performance.
• Implement automation and scripting to streamline support workflows and reduce manual interventions.
• Perform in-depth log analysis and debugging across multiple layers of the stack (firmware, drivers, hardware).
• Provide feedback to internal teams on common support issues to drive continuous improvements.
• Work with networking teams to troubleshoot connectivity issues affecting customer workloads.
• Support supercomputing infrastructure running GPU workloads at scale.
• Drive operational excellence by refining internal processes and support methodologies.
Qualifications:
Required:
• Experience in data centers, GPU clusters, server deployments, system administration, or hardware troubleshooting.
• Demonstrated experience driving resolutions and continuous improvements across cross-functional environments and teams within a data center environment.
• Intermediate knowledge of Linux (Ubuntu, CentOS, or similar), including command-line proficiency.
• Experience with NVIDIA GPUs, SuperMicro systems, Dell systems, high-performance computing (HPC), and large-scale data center environments.
• Experience in networking fundamentals (TCP/IP, VLANs, DNS, DHCP) and troubleshooting tools.
• Hands-on experience with firmware updates, BIOS configurations, and driver management.
• Experience analyzing system logs and debugging issues across firmware, drivers, and hardware layers.
• Experience working with Jira, Confluence, Notion, or other issue-tracking and documentation platforms.
• Experience in scripting and automation (Python, Bash, Ansible, or similar).
Preferred:
• You’re curious about Kubernetes, Docker, and containerized infrastructure.
• You have strong problem-solving skills with a proactive and analytical mindset.
• You have excellent communication skills and a demonstrated ability to work collaboratively in a fast-paced environment.
Company:
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads. Founded in 2017, the company is headquartered in Livingston, New Jersey, USA, with a team of 501-1000 employees. The company is currently Public Company.