Demo

HPC Systems Administrator, Networking & Data Center Operations

Empire AI
Buffalo, NY Full Time
POSTED ON 5/16/2026
AVAILABLE BEFORE 8/21/2026

About Empire AI

Empire AI is establishing New York as the national leader in responsible artificial intelligence. Backed by a consortium of top academic and research institutions including Columbia University, Cornell University, NYU, CUNY, RPI, SUNY, University of Rochester, RIT, Mount Sinai, and Flatiron Institute.

By leveraging the state's rich academic resources and research institutions, Empire AI is driving innovation in fields like medicine, education, energy, and climate change — all while giving New York's researchers access to computing resources that are often prohibitively expensive and only available to big tech companies, fueling statewide innovation, driving economic growth, and preparing a future-ready AI workforce to tackle society's most complex challenges.

The initiative is funded by $500 million in public and private investments, State Capital Grant, Academic Institutions, Simons Foundation, Flatiron Institute, and Tom Secunda (Co-Founder of Bloomberg).


Position Summary

The HPC Systems Administrator, Networking & Data Center Operations will design, deploy, and maintain the high-speed network fabrics and physical data center infrastructure that underpin Empire AI's shared and distributed high-performance computing environments. Reporting to the Manager, AI/ML Systems Administration, this role is responsible for ensuring the reliability, performance, and scalability of the network and data center systems that support AI/ML workloads, large scale simulations, and research computing across Empire AI's statewide consortium of academic and research institutions.

This role serves as the operational backbone of Empire AI's HPC infrastructure, translating architectural designs into hardened, high-availability environments while collaborating closely with systems, security, and research teams to meet the demands of cutting edge AI workloads across a federated, multi-institutional platform.


Duties and Responsibilities

High-Performance Networking

  • Design, deploy, and maintain InfiniBand (HDR/NDR) and RoCEv2/Ethernet fabrics for low-latency, high-throughput HPC and AI workloads across Empire AI's distributed cluster environment
  • Implement and manage leaf-spine network architectures, EVPN-VXLAN overlays, and RDMA optimized configurations across federated, multi institutional environments
  • Troubleshoot network layer performance bottlenecks, including MPI/NCCL collective traffic patterns and rail optimized topologies for LLM and multimodal AI workloads
  • Perform cable plant management, optical transceiver diagnostics, and switch firmware upgrades across the data center fabric
  • Evaluate and recommend emerging network hardware, interconnects, and architectures to meet Empire AI's evolving AI infrastructure needs


Data Center Operations

  • Plan and execute hardware deployments including racking, stacking, and cabling of compute nodes, GPU servers, switches, and storage arrays
  • Maintain DCIM (Data Center Infrastructure Management) records for accurate asset inventory, power mapping, and capacity planning across Empire AI's infrastructure
  • Coordinate with facilities teams on power, cooling and physical security for compute environments
  • Manage hardware lifecycle including procurement support, RMA processing, firmware/BIOS standardization, and decommissioning
  • Conduct routine health checks and physical inspections; respond to hardware alerts and data center incidents.
  • Develop and enforce data center standards for cable management, labeling, and physical documentation


HPC Cluster Administration

  • Deploy, configure, and maintain Linux-based HPC clusters (Rocky/Ubuntu) at scale, including compute, GPU, storage, and management nodes
  • Administer cluster management platforms such as NVIDIA Base Command Manager (BCM) for provisioning and system lifecycle management
  • Ensure workload portability and compatibility across heterogeneous hardware and storage platforms


Monitoring, Automation & Observability

  • Build and maintain comprehensive monitoring dashboards using Prometheus and Grafana to track cluster health, GPU utilization, network throughput, and job telemetry
  • Develop automation for provisioning, health checks, firmware updates, and configuration management
  • Implement alerting and escalation procedures for network, hardware, and infrastructure incidents


Collaboration & Documentation

  • Consult with research teams to assess computational and network connectivity needs, advise on workflow and data transfer optimization
  • Maintain clear system documentation, network diagrams, data center floor maps, configuration guides, and runbooks
  • Participate in special technical initiatives aligned with Empire AI's statewide AI infrastructure goals


Minimum Qualifications

  • Bachelor's degree in Computer Science, Engineering, Network Engineering, or a related technical field
  • 5 years of hands-on experience in systems administration with a focus on networking and data center operations
  • Expertise in InfiniBand networking (subnet management, UFM, opensm) and/or RoCEv2/Ethernet HPC fabrics
  • Solid understanding of x86/ARM server architecture, GPU systems, and HPC interconnects
  • Experience with leaf-spine network architecture, EVPN-VXLAN, and RDMA-optimized configurations
  • Proficiency in Bash and Python scripting for systems and network automation
  • Experience with monitoring stacks: Prometheus, Grafana, or equivalent
  • Ability to perform physical data center work including racking, cabling, and hardware troubleshooting


Preferred Qualifications

  • Experience with NVIDIA Base Command Manager (BCM), NVIDIA UFM, or DGX SuperPOD infrastructure
  • Familiarity with InfiniBand, NVLink, and GPU to GPU communication frameworks (NCCL, MPI) for AI training and LLM workloads
  • Proficiency in infrastructure-as-code and configuration management (Ansible, Git)
  • Familiarity with DCIM tools and capacity planning for large scale data centers
  • Knowledge of Science DMZ architecture and high-performance research data transfer tools (Globus)
  • Experience supporting HIPAA-compliant or NIST 800-171 regulated HPC environments

Salary.com Estimation for HPC Systems Administrator, Networking & Data Center Operations in Buffalo, NY
$82,781 to $104,965
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a HPC Systems Administrator, Networking & Data Center Operations?

Sign up to receive alerts about other jobs on the HPC Systems Administrator, Networking & Data Center Operations career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$83,502 - $107,152
Income Estimation: 
$104,896 - $133,785
Income Estimation: 
$123,198 - $153,566
Income Estimation: 
$83,502 - $107,152
Income Estimation: 
$104,896 - $133,785
Income Estimation: 
$123,198 - $153,566
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Not the job you're looking for? Here are some other HPC Systems Administrator, Networking & Data Center Operations jobs in the Buffalo, NY area that may be a better fit.

  • Empire AI Buffalo, NY
  • About Empire AI Empire AI is establishing New York as the national leader in responsible artificial intelligence. Backed by a consortium of top academic an... more
  • 19 Days Ago

  • Blue Signal Search Buffalo, NY
  • Data Center Operations Lead Location: On site in Buffalo, NY and Birmingham, AL Schedule: Full-time, on-site presence required. This role may occasionally ... more
  • 12 Days Ago

AI Assistant is available now!

Feel free to start your new journey!