What are the responsibilities and job description for the Director of Engineering position at XPerf Inc.?
About XPerf
XPerf AI is an early-stage AI infrastructure startup building software that optimizes GPU and ASIC utilization for large-scale AI compute clusters. We target private cloud, on-premise, and air-gapped datacenter environments — the underserved majority of enterprise AI deployments that hyperscaler-specific tooling leaves behind.
Our platform takes a vendor-agnostic approach across NVIDIA, AMD Instinct, and other accelerators, with a focus on execution efficiency: reducing wasted cycles, detecting training stragglers, and giving operators ground-truth visibility into what their hardware is actually doing.
We are a small, technical team moving fast. This is a foundational engineering leadership role, critical to the company's trajectory.
The Role: Director of Engineering
As Director of Engineering, you will own the technical execution of XPerf's core platform — from GPU observability and cluster instrumentation through Kubernetes-native orchestration and agentic AI application layers. You will work directly with the CEO and founding team to translate roadmap priorities into shipped software, make key architectural decisions, and build and lead a small, high-caliber engineering team.
You will drive the full product development lifecycle: defining application architecture and technical specifications, leading design reviews, breaking roadmap goals into development milestones, and actively participating in implementation and iteration. You will be accountable for quality, velocity, and on-time delivery across all engineering workstreams — setting the bar and holding the team to it.
This is a deeply hands-on role. You will be writing code, reviewing PRs, debugging complex distributed systems, and making real-time architecture calls — not delegating your way through problems.
What You'll Own
Platform Architecture & Engineering
- Design and evolve XPerf's core observability stack — DCGM, Prometheus, Grafana Alloy, custom exporters for GPU/ASIC utilization metrics
- Lead development of the performance optimization engine — straggler detection, execution efficiency delta reporting, workload-aware cluster analysis
- Architect multi-tenant, multi-accelerator instrumentation pipelines targeting NVIDIA and AMD GPU hardware
- Own the agentic AI application layer — RAG-based datacenter monitoring, automated troubleshooting, and intelligent alerting
Infrastructure, Kubernetes & Observability
- Implement GitOps pipelines, Helm chart libraries, and CI/CD automation across microservice deployments
- Own the full monitoring stack — Prometheus, Grafana Mimir/Thanos for multi-tenant scale, Loki for log aggregation, Alloy for collection
- Build and maintain the xperf-gpu-metrics platform with tiered backends — Direct/CLI, Prometheus, and XPerf Platform modes
- Develop NCCL/RCCL profiling hooks for deep training workload observability
Team & Process
- Establish engineering best practices, code review standards, and development processes as the team scales
- Collaborate with go-to-market on technical narratives, design partner integrations, and customer-facing engineering deliverables
- Evaluate and onboard new hardware design partners across the GPU accelerator ecosystem
What We're Looking For
Required
- 3–8 years of engineering experience in software engineering, platform/infrastructure engineering, or AI application development. Delivered at least three commercial enterprise software products.
- Strong software engineering fundamentals — proficiency in Python and Go for production services, automation, operators, and tooling
- AI application or agent development experience — RAG pipelines, LLM application layers, agentic frameworks, vector stores, or ML data pipelines — or a strong desire to own this layer as a core part of the role
- Production Kubernetes experience — bare metal or GPU clusters, custom operators/controllers, CRDs, RBAC, Helm, GitOps
- Hands-on GPU cluster operations — driver management, NCCL/RCCL testing, DCGM, GPU device plugins, distributed training debugging
- Strong networking fundamentals — TCP/IP, InfiniBand/RoCE, SR-IOV, high-speed NICs, CNI internals
- Familiarity with observability stacks — Prometheus, Grafana, and at least one long-term storage backend (Thanos, Mimir, Cortex)
- Demonstrated ability to lead engineering delivery — driving design reviews, setting milestones, managing iteration cycles, and shipping on time with quality
- Comfortable in early-stage environments — high ambiguity, fast iteration, and wearing multiple hats simultaneously
Strong Pluses
- Experience with non-NVIDIA accelerators — AMD Instinct GPU software stack, ROCm, RCCL, and AMD datacenter tooling
- AI/ML model development or training infrastructure experience — familiarity with distributed training frameworks such as DeepSpeed, Megatron, or PyTorch FSDP
- Experience building or operating AI data pipelines — embedding, vector indexing, retrieval, or model serving infrastructure
- LLM application architecture — multi-agent systems, tool-use frameworks, prompt engineering at scale, or fine-tuning pipelines
- HPC or AI workload scheduling experience — familiarity with job schedulers such as Volcano, SLURM, or LSF, gang scheduling, priority queuing, and resource quota management for large-scale GPU clusters
- SC (Supercomputing) conference participation or HPC community involvement
What You Won't Find Here
- A bureaucratic engineering org with layers of approvals — you will have real autonomy
- A purely managerial role — this position requires deep technical execution
- NVIDIA-only thinking — we work across NVIDIA and non-NVIDIA accelerators and you should be comfortable with both
- Hyperscaler-dependent assumptions — our customers run private clouds, on-prem clusters, and air-gapped environments
Compensation & Structure
- Location: Round Rock, TX — on-site preferred, hybrid considered
- Stage: Pre-Seed (post-announcement)
- Compensation: Competitive base meaningful early equity
- Reports To: CEO / Co-Founder
- Team Size: Small founding team — first engineering hire at this level
How to Apply
- Send a connection request to Alex Carter, CEO and Co-Founder, indicating interests in applying.
- Upon connection confirmation, follow up with your resume and a cover letter. The cover letter should explain why you are interested in applying and the most complex infrastructure or GPU cluster problem you've solved. We care far more about your technical depth and judgment than the prestige of your past employers.