Demo

Member of Technical Staff - Training Platform

Prime Intellect
San Francisco, CA Full Time
POSTED ON 5/11/2026
AVAILABLE BEFORE 6/9/2026
Building Open Superintelligence Infrastructure

Prime Intellect is building the open superintelligence stack - from frontier agentic models to the infrastructure that lets anyone create, train, and deploy them. We aggregate and orchestrate global compute into a single control plane and pair it with the full RL post-training stack: environments, secure sandboxes, verifiable evals, and our async RL trainer. We enable researchers, startups, and enterprises to run end-to-end reinforcement learning at frontier scale, adapting models to real tools, workflows, and deployment contexts.

We recently raised $15M in funding (taking total funding to $20M), led by Founders Fund with participation from Menlo Ventures and prominent angels including Andrej Karpathy (Eureka Labs, Tesla, OpenAI), Tri Dao (Chief Scientist, Together AI), Dylan Patel (SemiAnalysis), Clem Delangue (Hugging Face), Emad Mostaque (Stability AI), and many others.

Role Impact

You'll help build our hosted training platform - the product that lets users launch LoRA and full fine-tuning runs on managed GPU clusters with a single API call or a few clicks. The role spans the developer-facing platform and the underlying Kubernetes-based training infrastructure that runs the jobs.

Core Technical Responsibilities

Hosted Training Infrastructure

  • Design and operate Kubernetes-based training and inference orchestration across multi-cluster, multi-cloud GPU fleets
  • Build and maintain Helm charts that compose trainers, inference servers, environment servers, and supporting services into reproducible "Training stacks"
  • Develop the Python control-plane agents that watch pods, report run state to the platform, and keep clusters in sync
  • Implement scheduling and autoscaling for heterogeneous hardware (H100/H200/B200) using KEDA, LeaderWorkerSet, taints/tolerations, and gang scheduling
  • Run a tight GitOps workflow - every change ships through PRs, Helm values, and CI
  • Build node-local model caches, checkpoint pipelines, and shared storage for fast cold starts
  • Operate the observability stack (Prometheus, Grafana, Loki, DCGM) and make GPU cluster debugging fast

Platform Development

  • Build the developer-facing surfaces for hosted training: job submission, live run monitoring, logs, metrics, model/adapter management, comparisons
  • Develop FastAPI backend services and REST APIs that bridge the platform to running clusters
  • Build real-time monitoring and debugging tools (streaming logs, step-level metrics, failure analysis)
  • Ship product UI in Next.js / React / TypeScript with shadcn, Tailwind, tRPC, and TanStack Query

Research Bridge

  • Interface with the RL trainer, inference servers, and environment servers running inside our clusters
  • Productize new training capabilities (new model architectures, RL algorithms, modes)

Technical Requirements

We're looking for engineers who are fluent across three areas - you don't need to be the world's best at any one, but you should have real depth in all three and a clear point of view on how they connect.

AI & GPU Landscape

  • Strong working knowledge of the modern AI stack - open model families, finetuning techniques (LoRA, QLoRA, full FT, RLHF/RLAIF), inference engines (vLLM, SGLang, TensorRT-LLM)
  • Familiarity with GPU hardware tradeoffs (H100 / H200 / B200, NVLink, interconnects, memory hierarchy) and what they mean for training and inference workloads
  • Understanding of distributed training fundamentals (data/tensor/pipeline/expert parallelism, NCCL, multi-node scheduling)
  • Awareness of what's happening at the frontier - new models, training methods, infra patterns - and the ability to translate that into product decisions

Kubernetes & Infrastructure

  • Strong Kubernetes operations experience - Helm, CRDs, operators, KEDA, gang scheduling, GPU operator
  • Comfortable debugging real production clusters (kubectl, pod lifecycle, node issues, networking)
  • Cloud platform experience (GCP preferred - GCS, GKE, Cloud Run, Cloud Tasks)
  • Infrastructure automation (Helm, Terraform, Ansible) and a GitOps mindset
  • Observability: Prometheus, Grafana, Loki, OpenTelemetry, DCGM
  • Linux fundamentals: networking, namespaces, performance tuning

Programming & Platform

  • Strong Python backend development (FastAPI, async, SQLAlchemy)
  • Comfortable building Python control-plane agents that talk to Kubernetes APIs
  • Modern frontend development (TypeScript, React/Next.js, Tailwind, shadcn) - enough to ship product surfaces end-to-end
  • REST and tRPC API design
  • Experience building developer tools, dashboards, and live-monitoring UIs

What we offer

  • Cash compensation $150K–$300K with significant equity
  • Flexible work arrangement (remote or San Francisco office)
  • Full visa sponsorship and relocation support
  • Professional development budget for courses and conferences
  • Regular team off-sites and conference attendance
  • Opportunity to shape the future of decentralized AI development

Growth Opportunity

You'll join a team of experienced engineers and researchers working on cutting-edge problems in AI infrastructure. We believe in open development and encourage team members to contribute to the broader AI community through research and open-source work.

We value potential over perfection - if you're passionate about democratizing AI development and have experience in either platform or infrastructure development (ideally both), we want to talk to you.

Ready to help shape the future of AI? Apply now and join us in our mission to make powerful AI models accessible to everyone.

Salary : $150,000 - $300,000

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Member of Technical Staff - Training Platform?

Sign up to receive alerts about other jobs on the Member of Technical Staff - Training Platform career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$64,935 - $90,225
Income Estimation: 
$79,324 - $110,520
Income Estimation: 
$70,310 - $88,223
Income Estimation: 
$88,950 - $110,401
Income Estimation: 
$84,958 - $111,603
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Prime Intellect

  • Prime Intellect San Francisco, CA
  • Building Open Superintelligence Infrastructure Prime Intellect is building the open superintelligence stack — from frontier agentic models to the infra tha... more
  • 14 Days Ago

  • Prime Intellect San Francisco, CA
  • Building Open Superintelligence Infrastructure Prime Intellect is building the open superintelligence stack - from frontier agentic models to the infra tha... more
  • 14 Days Ago


Not the job you're looking for? Here are some other Member of Technical Staff - Training Platform jobs in the San Francisco, CA area that may be a better fit.

  • Listen Labs San Francisco, CA
  • TL;DR: We are seeing strong market demand and an aggressive 6-month product roadmap, so we are expanding our engineering team. We're looking for someone hi... more
  • 24 Days Ago

  • Reflection AI San Francisco, CA
  • Our Mission Reflection’s mission is to build open superintelligence and make it accessible to all . We’re developing open weight models for individuals, ag... more
  • 9 Days Ago

AI Assistant is available now!

Feel free to start your new journey!