Demo

Manager of Reliability Operations

Liquid Web
Southfield, MI Full Time
POSTED ON 4/8/2026
AVAILABLE BEFORE 5/6/2026
About Nexcess

Nexcess brings together a portfolio of hosting, cloud, and digital experience brands to deliver high-performance infrastructure and services to businesses worldwide.

Our platforms power mission-critical applications for thousands of customers. Reliability is foundational to everything we do. We operate complex environments spanning virtualization, storage, networking, and application hosting; where performance, availability, and consistency matter at scale.

This is a permanent, full-time, remote position.

US Pay Band - $110K - $150K Actual compensation will vary based on experience, skills, and location.

About The Role

We’re looking for a Manager of Reliability Operations to lead how we detect, respond to, and learn from failures across our platform ecosystem.

This role sits at the intersection of Operations and Engineering, bringing structure to incident response, accountability to follow-through, and clarity to reliability insights. You’ll ensure that what we learn from production directly improves how our platforms are built, operated, and scaled.

What You’ll Do

Own Reliability Operations & Incident Command

  • Continuously evolve and improve incident management, change management, and post-incident practices
  • Establish clear standards for incident declaration, severity, escalation, and communication
  • Ensure consistent execution across teams and continuous process improvement
  • Own the incident command function, including roles, structure, and operating procedures
  • Lead or oversee major incident response in a 24/7 production environment
  • Build and manage on-call incident commander rotations with global coverage

Drive Learning, Accountability & Reliability Strategy

  • Own post-incident reviews, ensuring strong root cause analysis and clear documentation
  • Translate incident trends into actionable reliability improvements
  • Drive completion of corrective actions across teams; escalate when needed
  • Define and maintain service performance and reliability targets (availability, latency, error rates)
  • Own observability strategy, including monitoring, alerting, and signal quality
  • Improve detection, reduce time to resolution, and increase platform resilience
  • Partner with Engineering and Operations on capacity planning, patching, and lifecycle decisions
  • Ensure reliability insights directly inform platform and infrastructure roadmaps
  • Collaborate with Security on vulnerability response, patch prioritization, and compliance alignment

Operate Across a Complex Platform Environment

  • Work across environments including virtualization platforms (VMware), distributed storage (Ceph), Linux-based systems, and hybrid cloud infrastructure
  • Support platforms that span dedicated hosting, managed applications, and high-availability cloud services
  • Ensure reliability practices scale across multiple products, brands, and customer environments
  • Provide regular, data-driven reporting to leadership on availability, incident trends, and operational performance
  • Act as the central authority on reliability insights across teams

What You Bring

  • Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
  • 7 experience in systems operations, site reliability, or platform engineering
  • 2 years experience leading teams or major operational functions
  • Proven experience managing incidents in a 24/7 production environment
  • Strong background in troubleshooting, root cause analysis, and operational improvement
  • Experience with change management practices

Platform & Tooling Experience

  • Monitoring and observability platforms (e.g., Datadog, Prometheus, Grafana, New Relic)
  • Incident management and alerting tools (e.g., PagerDuty, Opsgenie)
  • Infrastructure and platform technologies (Linux systems, VMware, Ceph, cloud platforms)
  • Logging and telemetry systems (centralized logging, metrics, tracing)
  • Ability to translate complex technical data into clear insights
  • Strong communication skills, especially in high-pressure situations

Nice to Have

  • Background in Computer Science, Engineering, or a related field
  • Experience in managed hosting, cloud infrastructure, or SaaS environments
  • Experience defining and tracking system reliability and performance targets
  • Familiarity with ITIL or similar operational frameworks
  • Exposure to VMware, Ceph, Linux, and Windows platforms
  • Relevant certifications (AWS, RHCE, etc.)

We Offer:

  • Traditional and Roth 401k with company matching
  • A collaborative team culture
  • Consistent/set work hours
  • Challenging non-redundant daily duties
  • A voice in how things get done

Disclaimer:

This job description is only a summary of the typical functions of the position. It is not intended to be an exhaustive or comprehensive list of all job responsibilities, tasks, or duties. Additional duties and tasks may be assigned as part of the job function. Liquid Web Inc. reserves the right to modify, interpret, or apply this job description in a way that best supports the organizational needs. The job description in no way creates or implies an employment contract. The employment contract remains “at will”.

Equal Employment Opportunity Policy: Liquid Web is committed to offering equal employment opportunity without regard to age, color, disability, gender, gender identity, genetic information, marital status, military status, national origin, race, religion, sexual orientation, veteran status, or any other legally protected characteristic.

Salary : $110,000 - $150,000

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Manager of Reliability Operations?

Sign up to receive alerts about other jobs on the Manager of Reliability Operations career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$154,184 - $199,940
Income Estimation: 
$189,563 - $242,917
Income Estimation: 
$154,184 - $199,940
Income Estimation: 
$189,563 - $242,917
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Liquid Web

  • Liquid Web Ashburn, VA
  • Servers.com provides dedicated hosting and cloud infrastructure solutions to customers worldwide. Our data center teams play a critical role in delivering ... more
  • 9 Days Ago

  • Liquid Web Dallas, TX
  • Are you a skilled Data Center Technician ready for your next challenge? Servers.com, a global leader in high-performance cloud infrastructure, is expanding... more
  • 8 Days Ago


Not the job you're looking for? Here are some other Manager of Reliability Operations jobs in the Southfield, MI area that may be a better fit.

  • Nexcess Southfield, MI
  • Description Description About Nexcess Nexcess brings together a portfolio of hosting, cloud, and digital experience brands to deliver high-performance infr... more
  • 2 Days Ago

  • Bluepoint Operations Waterford, MI
  • Culver’s is looking for a Restaurant Manager If you have a passion for restaurant industry and desire to serve others, then this job is for you! Our manage... more
  • 22 Days Ago

AI Assistant is available now!

Feel free to start your new journey!