Demo

Hyperbolic Labs - Senior Site Reliability Engineer

deCircle
San Francisco, CA Full Time
POSTED ON 4/26/2026
AVAILABLE BEFORE 6/26/2026

Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By aggregating computing resources across the globe, we offer an innovative GPU marketplace and AI inference service that promise affordability and accessibility for all. As pioneers at the intersection of AI and open-source technology, we believe in an open future where AI innovation is limited only by imagination, not by access to resources. We're looking for forward-thinking individuals who share our passion for making AI universally accessible, secure, and affordable. Join us in building a platform that empowers innovators everywhere to turn their visionary AI projects into reality.

As we prepare for growth after our Series A, our team — led by co-founders with PhDs in AI, Math, and Computer Science — is poised to redefine computing.

About the Role

We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security. As an aggregator of compute resources from hundreds of global suppliers, our SLOs, trust, and economic efficiency are product-critical. You'll be responsible for defining and maintaining service level objectives for job success rates, building robust incident response systems, managing capacity across our distributed GPU network, and implementing secure rollout and rollback mechanisms that keep our platform running smoothly 24/7.

In this role, you'll establish the reliability standards that define customer trust in our platform, design monitoring and alerting systems that provide deep visibility into our infrastructure, build automation for capacity management and resource allocation, lead incident response and post-mortem processes, and work closely with engineering teams to improve system resilience. You'll also focus on security and infrastructure hardening, ensuring strong isolation between tenants and suppliers, implementing key management systems, and building compliance frameworks. This is a high-impact position where your work directly influences our ability to deliver on our promise of affordable, accessible AI compute at scale.




  • Expert in site reliability engineering with proven experience defining, monitoring, and maintaining SLOs and SLAs for production systems

  • Strong background in capacity planning and management, including forecasting, resource allocation, and cost optimization for distributed systems

  • Experienced in incident response, on-call rotations, and post-mortem processes with a track record of reducing MTTR and improving system resilience

  • Deep knowledge of deployment systems including progressive rollouts, canary deployments, feature flags, and automated rollback mechanisms

  • Proficient in observability tools and practices including metrics, logging, tracing, and alerting systems (Prometheus, Grafana, ELK stack, or similar)

  • Strong understanding of infrastructure security including tenant isolation, workload isolation, network segmentation, and security hardening

  • Experience with secrets management, key management systems (KMS), certificate management, and secure credential rotation

  • Knowledge of compliance frameworks and security best practices for cloud platforms (SOC 2, ISO 27001, or similar)

  • Excellent problem-solving skills with ability to debug complex distributed systems issues under pressure

  • Strong automation mindset with experience using infrastructure-as-code, configuration management, and CI/CD pipelines

Preferred Qualifications

  • Experience operating GPU infrastructure, AI/ML platforms, or compute marketplaces at scale

  • Background in distributed systems, peer-to-peer networks, or decentralized infrastructure

  • Knowledge of multi-tenancy security patterns, container security, and runtime security tools

  • Experience with chaos engineering, fault injection, and resilience testing

  • Familiarity with cost optimization strategies for cloud infrastructure and GPU resources

  • Experience building and operating systems with demanding uptime requirements (99.9% SLAs)

  • Background at companies like AWS, Google Cloud, Azure, or fast-growing infrastructure startups

  • Contributions to open-source reliability, observability, or security tools

Salary.com Estimation for Hyperbolic Labs - Senior Site Reliability Engineer in San Francisco, CA
$118,633 to $165,442
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Hyperbolic Labs - Senior Site Reliability Engineer?

Sign up to receive alerts about other jobs on the Hyperbolic Labs - Senior Site Reliability Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$103,114 - $138,258
Income Estimation: 
$118,163 - $145,996
Income Estimation: 
$120,777 - $151,022
Income Estimation: 
$129,363 - $167,316
Income Estimation: 
$86,891 - $130,303
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at deCircle

  • deCircle York, NY
  • Our Legal team manages the legal affairs of certain actors in the M0 ecosystem. These include the M0 Foundation in Switzerland, the M0 labs located in the ... more
  • 1 Day Ago

  • deCircle York, NY
  • At Fun.xyz , we believe a tokenized future is a beneficial inevitability, granting financial emancipation to everyone with an internet connection. For a bl... more
  • 1 Day Ago

  • deCircle York, NY
  • At dYdX you'll have an opportunity to build state-of-the-art decentralized technologies that will redefine global financial markets. By joining us at this ... more
  • 1 Day Ago

  • deCircle San Francisco, CA
  • Odos (odos.xyz) is a smart order routing (SOR) solution that allows users to trade digital assets across decentralized exchanges. Launched in May 2022, Odo... more
  • 1 Day Ago


Not the job you're looking for? Here are some other Hyperbolic Labs - Senior Site Reliability Engineer jobs in the San Francisco, CA area that may be a better fit.

  • deCircle San Francisco, CA
  • Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By aggregating computing ... more
  • 4 Days Ago

  • Apple San Francisco, CA
  • Summary The Apple Services Engineering (ASE) team is one of the most exciting examples of Apple’s long-held passion for combining art and technology. These... more
  • 21 Days Ago

AI Assistant is available now!

Feel free to start your new journey!