What are the responsibilities and job description for the Senior, Staff Backend Engineer - Distributed System position at SproutsAI?
About Us
At Zettabyte, we’re on a mission to make AI compute ubiquitous, seamless, and limitless. We’re building a cloud where AI just works—anywhere, anytime. “AI Power. Everywhere.” Be part of the team designing the infrastructure for the AI-first world.
Why this role exists
We need a Backend Engineer to build the systems that orchestrate GPU clusters for AI workloads. You'll create APIs that handle GPU allocation, memory management, compute scheduling, and multi-tenant isolation—challenges unique to AI infrastructure that go far beyond typical backend engineering. As part of our backend team, you'll solve problems like
What you`ll do
We provide Competitive salary and equity based on your experience and skillset;
This is a Hybrid role - 3 days in office, 2 days WFH; Must locate in Palo Alto
Applicants must be authorized to work in the United States without need for visa sponsorship.
At Zettabyte, we’re on a mission to make AI compute ubiquitous, seamless, and limitless. We’re building a cloud where AI just works—anywhere, anytime. “AI Power. Everywhere.” Be part of the team designing the infrastructure for the AI-first world.
Why this role exists
We need a Backend Engineer to build the systems that orchestrate GPU clusters for AI workloads. You'll create APIs that handle GPU allocation, memory management, compute scheduling, and multi-tenant isolation—challenges unique to AI infrastructure that go far beyond typical backend engineering. As part of our backend team, you'll solve problems like
- How do we efficiently share expensive GPU resources across users?
- How do we handle GPU memory constraints for large AI models?
- How do we ensure quality of service when workloads compete for compute?
What you`ll do
- Design APIs that abstract complex GPU operations into simple developer experiences
- Build scheduling algorithms that maximize GPU utilization while ensuring SLA compliance
- Develop resource management systems for GPU lifecycle—provisioning, allocation, scheduling, and release
- Create usage tracking and billing systems for GPU-hours, memory usage, and compute utilization
- Implement monitoring for GPU-specific metrics, health checks, and automatic failure recovery
- Build multi-tenancy systems with resource isolation, quota management, and fair scheduling
- Optimize cold starts for model serving and implement efficient model loading strategies
- Collaborate with frontend engineers to expose complex infrastructure through intuitive interfaces
- Leverage AI-assisted coding tools (GitHub Copilot, Claude Code, Cursor IDE, etc.) to boost productivity and code quality.
- 5 years backend engineering experience with distributed systems
- Strong proficiency in Go, Python, or similar backend languages
- Experience with resource scheduling, orchestration, and API design (REST, GraphQL, gRPC)
- Understanding of hardware constraints and system optimization
- Linux systems knowledge and containerization experience (Docker, Kubernetes)
- Comfortable working with expensive resources where efficiency directly impacts costs
- Excited about solving novel problems in AI infrastructure (not just another CRUD app)
- Startup mindset—comfortable with ambiguity and rapid iteration
- GPU or HPC cluster management experience
- Understanding of ML,AI workload patterns and requirements
- Experience with high-value resource allocation systems
- Background in performance optimization for compute-intensive workloads
- Familiarity with GPU virtualization and sharing technologies
- Experience building billing or metering systems
We provide Competitive salary and equity based on your experience and skillset;
This is a Hybrid role - 3 days in office, 2 days WFH; Must locate in Palo Alto
Applicants must be authorized to work in the United States without need for visa sponsorship.
Sr Staff Software Engineer (Java backend, Distributed Systems, Cloud)
Lensa -
Santa Clara, CA
Staff/Senior Backend Engineer
Redolent, Inc -
Sunnyvale, CA
Senior Staff ML Infra Engineer- Distributed Systems
Advanced Micro Devices, Inc. -
Santa Clara, CA