Demo

Lead Site Reliability Engineer

Talent Vine
Washington, WA Full Time
POSTED ON 12/23/2025
AVAILABLE BEFORE 2/23/2026

About Our Client

Our client is redefining how modern defense technology is delivered. Based in Washington, D.C., they are built for the dynamic mission environment facing the Department of Defense, the Intelligence Community, and federal law enforcement agencies. They provide full-spectrum national security solutions that combine secure infrastructure, cleared talent, and mission-ready software to meet evolving defense challenges.

Their services include secure software development in classified environments and the design and implementation of advanced IT and cybersecurity capabilities—ranging from secure cloud architectures and enterprise infrastructure to data center operations, scientific analysis, and cutting-edge cyber defense.

They are led by technologists and veterans with firsthand mission experience, enabling deep understanding of both operational realities and the innovation required to succeed. Their approach is agile and outcome-based, delivering results in weeks rather than months whenever possible.

At our client, people, integrity, and excellence come first. They foster an environment where innovation thrives in support of mission-critical requirements. Team members receive competitive compensation, robust benefits, professional development and certification opportunities, and clear paths for growth while working on some of the nation’s most critical projects.

Core Values

  • Innovation & Responsiveness: Pushing beyond legacy models with efficient, tech-led solutions built to scale and evolve

  • Trusted Performance: Security, compliance, and deep experience delivering in demanding environments guide all work

  • Mission-Focused Expertise: From veteran leadership to cleared engineers, the team understands both the technology and the mission


About the Role

As the Lead Site Reliability Engineer for a major compute and AI infrastructure engagement, you will be responsible for the reliability, scalability, and performance of one of the largest hardware and AI infrastructure efforts in the U.S. defense sector.

You will lead the deployment, management, and automation of a high-performance computing mesh across multiple secure environments, ensuring operational excellence and mission continuity for a nine-figure government program.

This is a hands-on engineering leadership role that bridges physical infrastructure and modern DevOps automation—ideal for someone who thrives at the intersection of hardware systems, distributed computing, and AI/ML workflows.


What You’ll Do

  • Lead infrastructure design, deployment, and operations for large-scale hardware clusters across secure and distributed environments

  • Install and configure physical systems, including high-density GPU servers, networking gear, and storage arrays

  • Build and deploy secure Linux images and containerized workloads using OpenShift and other orchestration platforms

  • Develop and manage automation pipelines for provisioning, configuration management, and monitoring using modern DevOps toolchains (Ansible, Terraform, etc.)

  • Operate and maintain distributed networking meshes across classified and unclassified domains

  • Implement and manage out-of-band management tools (IPMI, iDRAC, BMC, etc.) for remote troubleshooting and control

  • Integrate and optimize NVIDIA GPU infrastructure for AI/ML training and inference workloads

  • Collaborate with mission engineers, software teams, and government operators to ensure system readiness and performance

  • Provide on-site technical leadership for deployments, troubleshooting, and continuous improvement

  • Mentor junior engineers and establish operational best practices as the program scales


What You’ll Bring

  • 3 years of experience in site reliability, systems engineering, or hardware operations roles

  • Deep expertise with physical infrastructure: server racking, cabling, diagnostics, and troubleshooting

  • Strong Linux systems administration experience, including imaging and automated deployment

  • Hands-on experience managing large-scale clusters or distributed systems in OpenShift or Kubernetes

  • Familiarity with DevOps automation (Ansible, Terraform, CI/CD pipelines)

  • Experience configuring and managing networking and mesh architectures

  • Direct experience with NVIDIA GPUs, CUDA, and AI/ML frameworks

  • Proficiency with out-of-band management tools (IPMI/iDRAC)

  • Certifications: Linux and Security (required or in progress)

  • Excellent communication, documentation, and problem-solving skills

  • Clearance: Active TS/SCI required 


Bonus Points

  • Experience operating in secure DoD or intelligence environments

  • Familiarity with Palantir platforms or other government data systems

  • Experience supporting AI/ML infrastructure in production or tactical settings

  • Experience tuning and monitoring HPC or GPU-accelerated clusters

Salary.com Estimation for Lead Site Reliability Engineer in Washington, WA
$165,372 to $204,115
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Lead Site Reliability Engineer?

Sign up to receive alerts about other jobs on the Lead Site Reliability Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$154,509 - $200,187
Income Estimation: 
$188,252 - $252,911
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Not the job you're looking for? Here are some other Lead Site Reliability Engineer jobs in the Washington, WA area that may be a better fit.

  • ExecutivePlacements.com Philadelphia, PA
  • Job Details Lead Site Reliability Engineer Our growing client is seeking a Lead Site Reliability Engineer. This is a 6 month contract to hire position with... more
  • 16 Days Ago

  • ExecutivePlacements.com York, NY
  • About The Role : Grade Level (for Internal Use) 11 Job Title: Senior Site Reliability Engineer Role Overview As a Site Reliability Engineer at ChartIQ, you... more
  • 24 Days Ago

AI Assistant is available now!

Feel free to start your new journey!