What are the responsibilities and job description for the Poweredge XE Server position at ITCAPS LLC?
Job Title: Poweredge XE Server
- PowerEdge Rack/Tower Experience, RHEL and Ubuntu OS, Experience working in a larger server environment.
- PowerEdge XE server experience - Poweredge XE-series GPU servers and accelerated/HPC environment
- Strong experience with HPC operations in production environments (scheduling, monitoring, troubleshooting, capacity management).
- Hands on expertise with NVIDIA GPU infrastructure (preferably GB200/GB300 class racks, H100/B100 or similar generations) including firmware/driver stack, monitoring, and lifecycle management.
- Familiarity with liquid cooled data center infrastructure: working safely around cold plate / rear door heat exchanger systems, understanding facility interfaces (CDU, manifolds, leak detection, etc.).
- Solid understanding of Linux administration (RHEL/CentOS/Ubuntu) in HPC/AI clusters: OS provisioning, patching, performance tuning, and troubleshooting.
- Strong Day 2 operations mindset: incident response, root-cause analysis, change management, and operational runbook creation.
- Excellent customer facing communication skills and the ability to work on site, embedded with the customer s team and their operations staff.
Day 2 operations & stability
- Own day to day operational support for the initial 48 fully loaded GB300 racks, scaling to grow toward 144 racks by year end.
- Monitor system health, performance, and capacity; proactively identify and remediate issues impacting uptime or SLAs.
- Perform incident triage, troubleshooting, and coordination with Dell and NVIDIA support as needed.
- Support day to day data center activities
- Proactively walk the data center through the day, watching/alerting customer of amber lights, hot doors etc.
- Escort dispatched field engineers (when applicable)
- LOIS parts management (will be trained once onsite)
- Maintain documentation of the environment: serial numbers, elevations, runbooks, etc.
- Coordinate with customer for any maintenance activities: power maintenance, cooling maintenance, containment wall maintenance, etc.
- Performs BIOS lifecycle management, adding users, patching
- Triage with Dell support for scheduling FSE break/fix activities
- Works at the direction of the customer team to update versions as needed
- Resource must be onsite 5 days there are no exceptions
- Training important the candidate selected will need to complete XE Server training. This will need to be completed prior to the starting the residency
- Infrastructure management
- Support rack level and node level lifecycle tasks: firmware/driver updates, BIOS tuning, OS patching, and configuration consistency.
- Assist with liquid cooling operations: safe work practices, coordination with facilities for maintenance/change activity, and monitoring of cooling performance and alarms.
Salary : $70 - $80