Demo

Site Reliability Engineer - Storage

xAI
Memphis, TN Full Time
POSTED ON 1/11/2026
AVAILABLE BEFORE 7/10/2026
About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

As a Site Reliability Engineer (SRE): Storage at xAI, you will play a pivotal role in ensuring the reliability, scalability, and performance of our petabyte-to-exabyte scale storage infrastructure, including filesystems and our internal storage product supporting the Colossus superclusters in Memphis — the world's largest AI training clusters with hundreds of thousands of  liquid-cooled GPUs. We're deploying multiple exabytes of storage this year across several sites to fuel Grok's training and advanced AI workloads. You will collaborate with storage engineers, software engineers and hardware storage teams to deploy, troubleshoot, and optimize storage for 24/7 AI I/O demands like checkpointing and dataset streaming, long term archival storage, and ensure maximum uptime. This is a hands-on technical position in a dynamic environment, offering the opportunity to tackle complex challenges at the intersection of storage systems, hardware integration, and reliability engineering.

Responsibilities
  • Deploy, maintain, and scale exabyte-scale storage clusters with a focus on observability, zero-downtime upgrades, and integration with high-density GPU environments.
  • Troubleshoot production storage issues across hardware-software stacks: NVMe/PCIe/RDMA paths, firmware bugs, BMC logs, disk failures—performing root cause analysis and automating preventions.
  • Collaborate with storage teams to validate server specs, debug field problems and influence custom designs with vendors for cutting-edge AI storage.
  • Evaluate and onboard new storage vendors and technologies; benchmark for cost, density and GPU-direct performance against AI training I/O patterns.
  • Support storage SDEs by translating engineering requirements into reliable, observable systems; develop scripting and playbooks to reduce toil and enable self-service.
  • Lead hardware refreshes for legacy X storage fleets, including migration, decommissioning, and designing repeatable processes for customized solutions.
  • Participate in on-call rotations (follow-the-sun, generous stipend) for storage domains; respond to incidents, drive post-mortems, and forecast capacity for EiB growth.
  • Create and maintain documentation, standard operating procedures, and monitoring for storage health in massive-scale AI pipelines.
Required Qualifications
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent experience).
  • 3 years in site reliability engineering, systems engineering, or storage operations at multi-PB scale.
  • Hands-on experience with storage systems from various vendors like VAST, DDN, Dell,  and parallel filesystems (such as Lustre, GPFS, Weka) and Linux storage stacks (kernel tuning, eBPF, blktrace, NVMe/RDMA/RoCE).
  • Proficiency in scripting for automation (Python/Bash); light programming experience (Go nice-to-have) but emphasis on operational clarity over heavy coding.
  • Strong troubleshooting skills across storage hardware (e.g., harddrives, SSDs, NVME drives, drive enclosures, and software firmware) and vendor qualification/refresh cycles.
  • Experience with incident response, including on-call rotations, rapid resolution, root cause analysis, and implementation of preventative measures.
  • Basic hardware knowledge for storage bring-up and debugging in data center environments.
  • Excellent communication and documentation skills, with the ability to share knowledge concisely and accurately.

xAI is an equal opportunity employer.

California Consumer Privacy Act (CCPA) Notice

Salary.com Estimation for Site Reliability Engineer - Storage in Memphis, TN
$87,658 to $103,361
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Site Reliability Engineer - Storage?

Sign up to receive alerts about other jobs on the Site Reliability Engineer - Storage career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$120,933 - $155,034
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$169,957 - $202,398
Income Estimation: 
$151,875 - $212,356
Income Estimation: 
$120,143 - $165,703
Income Estimation: 
$76,670 - $90,826
Income Estimation: 
$91,609 - $118,978
Income Estimation: 
$92,877 - $110,401
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at xAI

  • xAI Memphis, TN
  • About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small... more
  • 13 Days Ago

  • xAI Memphis, TN
  • About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small... more
  • 13 Days Ago

  • xAI Seattle, WA
  • About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small... more
  • 13 Days Ago

  • xAI Seattle, WA
  • About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small... more
  • 13 Days Ago


Not the job you're looking for? Here are some other Site Reliability Engineer - Storage jobs in the Memphis, TN area that may be a better fit.

  • IBM Poughkeepsie, AR
  • Introduction At IBM, work is more than a job - it's a calling: To build. To design. To code. To consult. To think along with clients and sell. To make mark... more
  • 24 Days Ago

  • xAI Memphis, TN
  • About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small... more
  • 2 Days Ago

AI Assistant is available now!

Feel free to start your new journey!