Demo

Member of Technical Staff, Site Reliablity Engineer

Bessemer Venture Partners India
San Francisco, CA Full Time
POSTED ON 6/4/2026
AVAILABLE BEFORE 7/3/2026
Voice AI that resolves, not transfers.

Most phone systems trap callers in menus and scripts. Vapi is the platform for deploying voice agents that know your business and can listen, adapt, and resolve in minutes.

  • The numbers: 1 billion calls. 1 million developers. 10x enterprise ARR growth
  • The customers: Amazon Ring, ServiceTitan, New York Life, Intuit, Kavak, and thousands more, from YC startups to the Fortune 500
  • The news: a $50M Series B led by Peak XV Partners, with Bessemer Venture Partners, Kleiner Perkins, M12 (Microsoft's Venture Fund), Y Combinator, and our earlier backers. Total raised: $72M

Why We’re Hiring This Role:

  • 99.99% call completion is the number this role drives. Vapi runs live phone calls — a p99 spike means callers drop. We’ve had 15 stability-gap outages worth learning from, and we need someone who runs incident command, owns SLOs and error budgets, and builds the reliability culture from scratch.
  • This is not a bash-and-YAML role. You’ll ship code (Go or TypeScript) for services that monitor and manage the platform: auto-remediation, capacity forecasters, oncall tooling. Capacity planning, load testing, and KEDA-based autoscaling for Vapi’s wscaler and workerpool-cron-scaler are on your plate.

What You’ll Do:

  • 30 Day: Join the oncall rotation. Walk the 15 stability-gap incidents and turn the patterns into a prioritized reliability backlog. Define the first set of SLOs for the call-completion path.
  • 60 Day: Stand up error budgets and SLO-based alerting in Chronosphere/Prometheus for the highest-impact services. Run the first proper load test against provider rate limits and per-org concurrency. Tune autoscaling for wscaler / workerpool-cron-scaler.
  • 90 Day: Ship a real platform service — capacity forecaster, auto-remediation, or oncall tooling — in Go or TypeScript. Own the postmortem process. Drive a measurable improvement in p99 call completion or MTTR.

Who You Are:

Must-haves

  • You’ve run incident command and postmortem discipline at scale on a real oncall rotation.
  • You’ve operated SLOs and error budgets in Chronosphere, Prometheus, Grafana, or Datadog.
  • You’ve done capacity planning and load testing for production systems with real users.
  • You’re fluent in Kubernetes production ops: pod crash diagnosis, HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown.
  • You know backpressure and autoscaling patterns — KEDA, custom metrics scaling.

Nice-to-haves

  • You ship code, not just scripts. You can build platform services in Go or TypeScript (matches Vapi’s cluster-manager, database-health, wscaler, incidentManager).
  • Real-time / latency-sensitive product background where degraded means a dropped call, not a slow dashboard.

Tech stack you’ll work in

  • Languages: Go and TypeScript (you ship code, not just scripts), Bash.
  • Observability: Chronosphere, Prometheus, Grafana, Datadog, OpenTelemetry.
  • Orchestration: Kubernetes on EKS — production ops (HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown, pod crash diagnosis).
  • Autoscaling and backpressure: KEDA, custom metrics scaling (matches Vapi’s wscaler and workerpool-cron-scaler).
  • Load testing: script-based load testing, provider rate-limit auditing, per-org concurrency auditing.
  • Vapi services you’ll touch or build: cluster-manager, database-health, wscaler, incidentManager.

Where you likely come from

  • A real-time / latency-sensitive product (Discord, Zoom, Mux, Twitch, Twilio, LiveKit, Cloudflare, a trading firm, a gaming backend), or a FAANG SRE / Production Engineer (Google, Uber, Twitter/X, Meta) who misses being hands-on.
  • Weak fit: SRE from analytics or CRM backends where “degraded” means a slow dashboard, not a dropped call. Anyone uncomfortable reading or writing code.

Why Vapi:

  • Generational impact: Build the human interface for every business
  • Ownership culture: 70% of the company are previous founders
  • Kind team: The founders, Jordan and Nikhil, are Canadians
  • Tier-1 Investors: YC, KP seed, Bessemer Series A

What We Offer:

  • Real stake: We offer a competitive salary and excellent equity ownership
  • Comprehensive health coverage: medical, dental, and vision plans
  • Team love: We love hanging out, and we do quarterly off-sites
  • Flexible time off: take what you need

More: catered meals, transportation, gym, and a $10k annual L&D budget

Compensation Range: $200K - $270K

Voice AI that resolves, not transfers.

Most phone systems trap callers in menus and scripts. Vapi is the platform for deploying voice agents that know your business and can listen, adapt, and resolve in minutes.

  • The numbers: 1 billion calls. 1 million developers. 10x enterprise ARR growth
  • The customers: Amazon Ring, ServiceTitan, New York Life, Intuit, Kavak, and thousands more, from YC startups to the Fortune 500
  • The news: a $50M Series B led by Peak XV Partners, with Bessemer Venture Partners, Kleiner Perkins, M12 (Microsoft's Venture Fund), Y Combinator, and our earlier backers. Total raised: $72M

Why We’re Hiring This Role:

  • 99.99% call completion is the number this role drives. Vapi runs live phone calls — a p99 spike means callers drop. We’ve had 15 stability-gap outages worth learning from, and we need someone who runs incident command, owns SLOs and error budgets, and builds the reliability culture from scratch.
  • This is not a bash-and-YAML role. You’ll ship code (Go or TypeScript) for services that monitor and manage the platform: auto-remediation, capacity forecasters, oncall tooling. Capacity planning, load testing, and KEDA-based autoscaling for Vapi’s wscaler and workerpool-cron-scaler are on your plate.

What You’ll Do:

  • 30 Day: Join the oncall rotation. Walk the 15 stability-gap incidents and turn the patterns into a prioritized reliability backlog. Define the first set of SLOs for the call-completion path.
  • 60 Day: Stand up error budgets and SLO-based alerting in Chronosphere/Prometheus for the highest-impact services. Run the first proper load test against provider rate limits and per-org concurrency. Tune autoscaling for wscaler / workerpool-cron-scaler.
  • 90 Day: Ship a real platform service — capacity forecaster, auto-remediation, or oncall tooling — in Go or TypeScript. Own the postmortem process. Drive a measurable improvement in p99 call completion or MTTR.

Who You Are:

Must-haves

  • You’ve run incident command and postmortem discipline at scale on a real oncall rotation.
  • You’ve operated SLOs and error budgets in Chronosphere, Prometheus, Grafana, or Datadog.
  • You’ve done capacity planning and load testing for production systems with real users.
  • You’re fluent in Kubernetes production ops: pod crash diagnosis, HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown.
  • You know backpressure and autoscaling patterns — KEDA, custom metrics scaling.

Nice-to-haves

  • You ship code, not just scripts. You can build platform services in Go or TypeScript (matches Vapi’s cluster-manager, database-health, wscaler, incidentManager).
  • Real-time / latency-sensitive product background where degraded means a dropped call, not a slow dashboard.

Tech stack you’ll work in

  • Languages: Go and TypeScript (you ship code, not just scripts), Bash.
  • Observability: Chronosphere, Prometheus, Grafana, Datadog, OpenTelemetry.
  • Orchestration: Kubernetes on EKS — production ops (HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown, pod crash diagnosis).
  • Autoscaling and backpressure: KEDA, custom metrics scaling (matches Vapi’s wscaler and workerpool-cron-scaler).
  • Load testing: script-based load testing, provider rate-limit auditing, per-org concurrency auditing.
  • Vapi services you’ll touch or build: cluster-manager, database-health, wscaler, incidentManager.

Where you likely come from

  • A real-time / latency-sensitive product (Discord, Zoom, Mux, Twitch, Twilio, LiveKit, Cloudflare, a trading firm, a gaming backend), or a FAANG SRE / Production Engineer (Google, Uber, Twitter/X, Meta) who misses being hands-on.
  • Weak fit: SRE from analytics or CRM backends where “degraded” means a slow dashboard, not a dropped call. Anyone uncomfortable reading or writing code.

Why Vapi:

  • Generational impact: Build the human interface for every business
  • Ownership culture: 70% of the company are previous founders
  • Kind team: The founders, Jordan and Nikhil, are Canadians
  • Tier-1 Investors: YC, KP seed, Bessemer Series A

What We Offer:

  • Real stake: We offer a competitive salary and excellent equity ownership
  • Comprehensive health coverage: medical, dental, and vision plans
  • Team love: We love hanging out, and we do quarterly off-sites
  • Flexible time off: take what you need

More: catered meals, transportation, gym, and a $10k annual L&D budget

Compensation Range: $200K - $270K

Salary : $10,000

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Member of Technical Staff, Site Reliablity Engineer?

Sign up to receive alerts about other jobs on the Member of Technical Staff, Site Reliablity Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$50,145 - $86,059
Income Estimation: 
$77,602 - $107,094
Income Estimation: 
$62,373 - $78,280
Income Estimation: 
$79,473 - $93,666
Income Estimation: 
$90,372 - $103,622
Income Estimation: 
$61,825 - $80,560
Income Estimation: 
$90,032 - $105,965
Income Estimation: 
$85,996 - $102,718
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Bessemer Venture Partners India

  • Bessemer Venture Partners India Chicago, IL
  • About Us Legora is redefining how legal work gets done. Not built for lawyers, built with them. We work alongside the world’s best legal teams, who expect ... more
  • Just Posted

  • Bessemer Venture Partners India York, NY
  • About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and ... more
  • Just Posted

  • Bessemer Venture Partners India Denver, CO
  • About Us Legora is redefining how legal work gets done. Not built for lawyers, built with them. We work alongside the world’s best legal teams, who expect ... more
  • 1 Day Ago

  • Bessemer Venture Partners India Pittsburgh, PA
  • About Abridge Abridge was founded in 2018 with the mission of powering deeper understanding in healthcare. Our AI-powered platform was purpose-built for me... more
  • 1 Day Ago


Not the job you're looking for? Here are some other Member of Technical Staff, Site Reliablity Engineer jobs in the San Francisco, CA area that may be a better fit.

  • Vapi San Francisco, CA
  • Voice AI that resolves, not transfers. Most phone systems trap callers in menus and scripts. Vapi is the platform for deploying voice agents that know your... more
  • Just Posted

  • fireworksai San Mateo, CA
  • About Us: At Fireworks, we’re building the future of generative AI infrastructure. Our platform delivers the highest-quality models with the fastest and mo... more
  • Just Posted

AI Assistant is available now!

Feel free to start your new journey!