What are the responsibilities and job description for the Technical Program Manager (TPM) position at GMI Cloud?
About US
GMI Cloud is a fast-growing AI infrastructure company backed by Headline VC and one of only six cloud providers worldwide to earn NVIDIA’s prestigious Reference Platform Cloud Partner designation . We operate 8 of our own GPU clusters across the U.S. and Asia, delivering a full spectrum of services from GPU compute service to AI model inference API solutions. As an NVIDIA Reference Platform Cloud Partner, our infrastructure meets the highest standards for performance, security, and scalability in AI deployments. We empower AI startups and enterprises to “build AI without limits,” providing everything they need to prototype, train, and deploy AI models quickly and reliably.
About this role
We are looking for a Technical Program Manager (TPM) who combines strong program ownership, production sense, and execution rigor with a solid technical foundation in AI infrastructure and distributed systems.
In this role, you will own and drive complex, cross-functional programs that span GPU infrastructure, Kubernetes platforms, inference/training systems, and customer-facing AI services. You will ensure that high-impact initiatives move from design → implementation → production → scale, on time and with high quality.
This is a hands-on TPM role for someone who can:
- Speak fluently with engineers
- Think like a product manager about tradeoffs and user impact
- And execute like an owner in a fast-moving environment.
Key Responsibilities
Program Ownership & Delivery Excellence
- Own end-to-end delivery of complex technical programs across AI infrastructure, GPU platforms, and cloud services.
- Define clear goals, milestones, dependencies, risks, and success metrics for multi-team initiatives.
- Drive execution rigor: timelines, accountability, escalation, and delivery predictability.
- Ensure programs reach production-ready quality, not just prototype or design completion.
Technical & Production Sense
- Partner closely with engineering to translate technical designs into executable delivery plans.
- Apply strong production judgment around reliability, scalability, performance, and operational readiness.
- Identify risks early (capacity, performance, security, operational complexity) and drive mitigation plans.
- Ensure launches include proper monitoring, alerting, rollback strategies, and operational ownership.
Cross-Functional Leadership
- Act as the connective tissue across Engineering, Product, Infrastructure, SRE, and Go-To-Market teams.
- Align stakeholders on priorities, tradeoffs, and sequencing in a resource-constrained environment.
- Communicate clearly and concisely to both technical and non-technical audiences, including leadership updates.
Platform & Infrastructure Programs
- Drive programs related to:
- GPU cluster expansion and lifecycle management
- Kubernetes platform evolution
- AI inference and training infrastructure
- Internal developer platforms and tooling
- Coordinate roadmap execution across multiple regions and environments.
Required Skills
Program Management: Proven ability to drive cross-functional technical programs from design to production with strong ownership and execution discipline.
Production Sense: Strong judgment around API reliability, latency, scalability, rollout quality, and operational readiness for production AI services.
AI / LLM Systems: Solid understanding of LLM and multimodal model inference workflows, including text, image, audio, or video APIs.
Inference & Serving: Familiarity with model serving concepts such as throughput, tail latency, batching, streaming, and cost-performance tradeoffs.
AI Infrastructure: General understanding of GPU-based inference systems and their impact on performance and scalability.
Communication: Clear, structured communication with engineering, product, and leadership stakeholders.
Preferred Qualifications
- Experience managing technical programs related to AI products, model APIs, or inference platforms.
- Hands-on exposure to LLM or multimodal inference systems in production environments.
- Familiarity with AI model APIs, SDKs, or developer-facing AI platforms.
- Strong technical foundation with the ability to reason about system tradeoffs and customer impact.
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.