What are the responsibilities and job description for the HPC Systems Administrator, Networking & Data Center Operations position at Empire AI?

About Empire AI

Empire AI is establishing New York as the national leader in responsible artificial intelligence. Backed by a consortium of top academic and research institutions including Columbia University, Cornell University, NYU, CUNY, RPI, SUNY, University of Rochester, RIT, Mount Sinai, and Flatiron Institute.

By leveraging the state's rich academic resources and research institutions, Empire AI is driving innovation in fields like medicine, education, energy, and climate change — all while giving New York's researchers access to computing resources that are often prohibitively expensive and only available to big tech companies, fueling statewide innovation, driving economic growth, and preparing a future-ready AI workforce to tackle society's most complex challenges.

The initiative is funded by $500 million in public and private investments, State Capital Grant, Academic Institutions, Simons Foundation, Flatiron Institute, and Tom Secunda (Co-Founder of Bloomberg).

Position Summary

The HPC Systems Administrator, Networking & Data Center Operations will design, deploy, and maintain the high-speed network fabrics and physical data center infrastructure that underpin Empire AI's shared and distributed high-performance computing environments. Reporting to the Manager, AI/ML Systems Administration, this role is responsible for ensuring the reliability, performance, and scalability of the network and data center systems that support AI/ML workloads, large scale simulations, and research computing across Empire AI's statewide consortium of academic and research institutions.

This role serves as the operational backbone of Empire AI's HPC infrastructure, translating architectural designs into hardened, high-availability environments while collaborating closely with systems, security, and research teams to meet the demands of cutting edge AI workloads across a federated, multi-institutional platform.

Duties and Responsibilities

High-Performance Networking

Design, deploy, and maintain InfiniBand (HDR/NDR) and RoCEv2/Ethernet fabrics for low-latency, high-throughput HPC and AI workloads across Empire AI's distributed cluster environment
Implement and manage leaf-spine network architectures, EVPN-VXLAN overlays, and RDMA optimized configurations across federated, multi institutional environments
Troubleshoot network layer performance bottlenecks, including MPI/NCCL collective traffic patterns and rail optimized topologies for LLM and multimodal AI workloads
Perform cable plant management, optical transceiver diagnostics, and switch firmware upgrades across the data center fabric
Evaluate and recommend emerging network hardware, interconnects, and architectures to meet Empire AI's evolving AI infrastructure needs

Data Center Operations

Plan and execute hardware deployments including racking, stacking, and cabling of compute nodes, GPU servers, switches, and storage arrays
Maintain DCIM (Data Center Infrastructure Management) records for accurate asset inventory, power mapping, and capacity planning across Empire AI's infrastructure
Coordinate with facilities teams on power, cooling and physical security for compute environments
Manage hardware lifecycle including procurement support, RMA processing, firmware/BIOS standardization, and decommissioning
Conduct routine health checks and physical inspections; respond to hardware alerts and data center incidents.
Develop and enforce data center standards for cable management, labeling, and physical documentation

HPC Cluster Administration

Deploy, configure, and maintain Linux-based HPC clusters (Rocky/Ubuntu) at scale, including compute, GPU, storage, and management nodes
Administer cluster management platforms such as NVIDIA Base Command Manager (BCM) for provisioning and system lifecycle management
Ensure workload portability and compatibility across heterogeneous hardware and storage platforms

Monitoring, Automation & Observability

Build and maintain comprehensive monitoring dashboards using Prometheus and Grafana to track cluster health, GPU utilization, network throughput, and job telemetry
Develop automation for provisioning, health checks, firmware updates, and configuration management
Implement alerting and escalation procedures for network, hardware, and infrastructure incidents

Collaboration & Documentation

Consult with research teams to assess computational and network connectivity needs, advise on workflow and data transfer optimization
Maintain clear system documentation, network diagrams, data center floor maps, configuration guides, and runbooks
Participate in special technical initiatives aligned with Empire AI's statewide AI infrastructure goals

Minimum Qualifications

Bachelor's degree in Computer Science, Engineering, Network Engineering, or a related technical field
5 years of hands-on experience in systems administration with a focus on networking and data center operations
Expertise in InfiniBand networking (subnet management, UFM, opensm) and/or RoCEv2/Ethernet HPC fabrics
Solid understanding of x86/ARM server architecture, GPU systems, and HPC interconnects
Experience with leaf-spine network architecture, EVPN-VXLAN, and RDMA-optimized configurations
Proficiency in Bash and Python scripting for systems and network automation
Experience with monitoring stacks: Prometheus, Grafana, or equivalent
Ability to perform physical data center work including racking, cabling, and hardware troubleshooting

Preferred Qualifications

Experience with NVIDIA Base Command Manager (BCM), NVIDIA UFM, or DGX SuperPOD infrastructure
Familiarity with InfiniBand, NVLink, and GPU to GPU communication frameworks (NCCL, MPI) for AI training and LLM workloads
Proficiency in infrastructure-as-code and configuration management (Ansible, Git)
Familiarity with DCIM tools and capacity planning for large scale data centers
Knowledge of Science DMZ architecture and high-performance research data transfer tools (Globus)
Experience supporting HIPAA-compliant or NIST 800-171 regulated HPC environments

Apply for this job

Receive alerts for other HPC Systems Administrator, Networking & Data Center Operations job openings

HPC Systems Administrator, Networking & Data Center Operations

What are the responsibilities and job description for the HPC Systems Administrator, Networking & Data Center Operations position at Empire AI?

What is the career path for a HPC Systems Administrator, Networking & Data Center Operations?

Not the job you're looking for? Here are some other HPC Systems Administrator, Networking & Data Center Operations jobs in the Buffalo, NY area that may be a better fit.

We don't have any other HPC Systems Administrator, Networking & Data Center Operations jobs in the Buffalo, NY area right now.

AI Assistant is available now!