Demo

Senior Site Reliability Engineer (SRE)

Rhino Federated Computing
Boston, MA Full Time
POSTED ON 10/26/2025 CLOSED ON 12/16/2025

What are the responsibilities and job description for the Senior Site Reliability Engineer (SRE) position at Rhino Federated Computing?

About the role

The Senior Site Reliability Engineer will be responsible for engineering the reliability and resilience of Rhino’s Federated Computing Platform (Rhino FCP). This distributed infrastructure supports cutting-edge AI/ML research and development across highly regulated industries, including healthcare, finance, and life sciences, by enabling secure, privacy-preserving data collaboration around the world.

You will apply a software engineering discipline to operations, focusing on the production environment across a fleet of installations deployed behind the firewalls of partner organizations and our centralized cloud orchestration layer. You'll help define and monitor the Service Level Objectives (SLOs) for the platform’s core services. You'll collaborate closely with backend engineers and devops engineers to integrate reliability directly into the FCP’s architecture.

This role involves proactive ownership of production risk: from defining reliability metrics and reducing operational toil to designing failure-resilient systems and leading blameless incident response. It’s ideal for someone who thrives on solving complex distributed systems problems, views automation as a primary engineering function, and is excited to drive guaranteed stability in secure, distributed AI platforms.


Key Responsibilities
  • Help Define and Monitor Reliability Standards: Define, monitor, and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for core platform services.
  • Toil Reduction and Automation: Systematically identify, prioritize, and automate repetitive operational work to eliminate manual tasks and improve system predictability and consistency.
  • Technical Infrastructure and Installation Support: Support internal stakeholders and work with customers on technical issues related to platform installation and platform infrastructure, leveraging learnings from support issues to drive further improvements in processes and automation.
  • Capacity and Performance Optimization: Conduct load testing and capacity planning to predict and address scaling bottlenecks, specifically optimizing the performance and resource utilization of the federated AI/ML computing clusters.
  • Production Readiness Engineering: Partner with development teams from the design phase (shifting left) to ensure new infrastructure and core components are inherently scalable, observable, and failure-resilient before they reach production.
  • Incident and Emergency Response: Be the escalation point for critical incidents, lead the swift restoration of service, and drive blameless retrospectives and corrective action follow-ups.


About the candidate

Candidates should have 5 years of professional experience with a mix of the experiences described below:

  • 5 years of experience in SRE roles utilizing cloud platforms (AWS, GCP, and/or Azure).
  • 5 years of experience with Linux
  • 5 years of experience with Bash/Python
  • 3 years of experience working with IaC and CM tools (Terragrunt, Ansible)
  • 3 years of experience designing and developing infrastructure components with Kubernetes
  • 5 years of experience implementing and maintaining observability solutions (Prometheus/VictoriaMetrics, Grafana, etc) and reporting on SLIs and SLOs.
  • Deep understanding of networking, particularly in complex, high-security contexts: mTLS, gRPC, network policies within Kubernetes, overlay networks (e.g., WireGuard/VPNs) for distributed client deployments, and isolation in confidential computing environments.
  • Experience working in a startup environment
  • Advantage for experience with AI/ML-based products or platforms
  • Advantage for experience with distributed systems
  • Advantage for experience with products with a focus on data security and privacy (e.g., PII data protection)
  • The role is open to candidates who are based in Boston, MA (hybrid work environment)


Senior Site Reliability Engineer
Dentsply Sirona -
Waltham, MA
Site Reliability Engineer
Iron Mountain -
Boston, MA
Site Reliability Engineer
Red Hat -
Boston, MA

Salary.com Estimation for Senior Site Reliability Engineer (SRE) in Boston, MA
$118,127 to $138,518
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Senior Site Reliability Engineer (SRE)?

Sign up to receive alerts about other jobs on the Senior Site Reliability Engineer (SRE) career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$151,875 - $212,356
Income Estimation: 
$169,957 - $202,398
Income Estimation: 
$76,670 - $90,826
Income Estimation: 
$91,609 - $118,978
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$120,933 - $155,034
Income Estimation: 
$114,618 - $136,401
This job has expired.
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Not the job you're looking for? Here are some other Senior Site Reliability Engineer (SRE) jobs in the Boston, MA area that may be a better fit.

  • Ahold Delhaize USA Quincy, MA
  • Category/Area of Expertise: IT & Technology Job Requisition: 436142 Address: USA-MA-Quincy-1385 Hancock Street Store Code: Development (5157947) Ahold Delh... more
  • 16 Days Ago

  • CyberArk Newton, MA
  • Company Description About CyberArk: CyberArk (NASDAQ: CYBR), is the global leader in Identity Security. Centered on privileged access management, CyberArk ... more
  • 23 Days Ago

AI Assistant is available now!

Feel free to start your new journey!