What are the responsibilities and job description for the Director of Engineering position at XPerf Inc.?

About XPerf

XPerf AI is an early-stage AI infrastructure startup building software that optimizes GPU and ASIC utilization for large-scale AI compute clusters. We target private cloud, on-premise, and air-gapped datacenter environments — the underserved majority of enterprise AI deployments that hyperscaler-specific tooling leaves behind.

Our platform takes a vendor-agnostic approach across NVIDIA, AMD Instinct, and other accelerators, with a focus on execution efficiency: reducing wasted cycles, detecting training stragglers, and giving operators ground-truth visibility into what their hardware is actually doing.

We are a small, technical team moving fast. This is a foundational engineering leadership role, critical to the company's trajectory.

The Role: Director of Engineering

As Director of Engineering, you will own the technical execution of XPerf's core platform — from GPU observability and cluster instrumentation through Kubernetes-native orchestration and agentic AI application layers. You will work directly with the CEO and founding team to translate roadmap priorities into shipped software, make key architectural decisions, and build and lead a small, high-caliber engineering team.

You will drive the full product development lifecycle: defining application architecture and technical specifications, leading design reviews, breaking roadmap goals into development milestones, and actively participating in implementation and iteration. You will be accountable for quality, velocity, and on-time delivery across all engineering workstreams — setting the bar and holding the team to it.

This is a deeply hands-on role. You will be writing code, reviewing PRs, debugging complex distributed systems, and making real-time architecture calls — not delegating your way through problems.

What You'll Own

Platform Architecture & Engineering

Design and evolve XPerf's core observability stack — DCGM, Prometheus, Grafana Alloy, custom exporters for GPU/ASIC utilization metrics
Lead development of the performance optimization engine — straggler detection, execution efficiency delta reporting, workload-aware cluster analysis
Architect multi-tenant, multi-accelerator instrumentation pipelines targeting NVIDIA and AMD GPU hardware
Own the agentic AI application layer — RAG-based datacenter monitoring, automated troubleshooting, and intelligent alerting

Infrastructure, Kubernetes & Observability

Implement GitOps pipelines, Helm chart libraries, and CI/CD automation across microservice deployments
Own the full monitoring stack — Prometheus, Grafana Mimir/Thanos for multi-tenant scale, Loki for log aggregation, Alloy for collection
Build and maintain the xperf-gpu-metrics platform with tiered backends — Direct/CLI, Prometheus, and XPerf Platform modes
Develop NCCL/RCCL profiling hooks for deep training workload observability

Team & Process

Establish engineering best practices, code review standards, and development processes as the team scales
Collaborate with go-to-market on technical narratives, design partner integrations, and customer-facing engineering deliverables
Evaluate and onboard new hardware design partners across the GPU accelerator ecosystem

What We're Looking For

Required

3–8 years of engineering experience in software engineering, platform/infrastructure engineering, or AI application development. Delivered at least three commercial enterprise software products.
Strong software engineering fundamentals — proficiency in Python and Go for production services, automation, operators, and tooling
AI application or agent development experience — RAG pipelines, LLM application layers, agentic frameworks, vector stores, or ML data pipelines — or a strong desire to own this layer as a core part of the role
Production Kubernetes experience — bare metal or GPU clusters, custom operators/controllers, CRDs, RBAC, Helm, GitOps
Hands-on GPU cluster operations — driver management, NCCL/RCCL testing, DCGM, GPU device plugins, distributed training debugging
Strong networking fundamentals — TCP/IP, InfiniBand/RoCE, SR-IOV, high-speed NICs, CNI internals
Familiarity with observability stacks — Prometheus, Grafana, and at least one long-term storage backend (Thanos, Mimir, Cortex)
Demonstrated ability to lead engineering delivery — driving design reviews, setting milestones, managing iteration cycles, and shipping on time with quality
Comfortable in early-stage environments — high ambiguity, fast iteration, and wearing multiple hats simultaneously

Strong Pluses

Experience with non-NVIDIA accelerators — AMD Instinct GPU software stack, ROCm, RCCL, and AMD datacenter tooling
AI/ML model development or training infrastructure experience — familiarity with distributed training frameworks such as DeepSpeed, Megatron, or PyTorch FSDP
Experience building or operating AI data pipelines — embedding, vector indexing, retrieval, or model serving infrastructure
LLM application architecture — multi-agent systems, tool-use frameworks, prompt engineering at scale, or fine-tuning pipelines
HPC or AI workload scheduling experience — familiarity with job schedulers such as Volcano, SLURM, or LSF, gang scheduling, priority queuing, and resource quota management for large-scale GPU clusters
SC (Supercomputing) conference participation or HPC community involvement

What You Won't Find Here

A bureaucratic engineering org with layers of approvals — you will have real autonomy
A purely managerial role — this position requires deep technical execution
NVIDIA-only thinking — we work across NVIDIA and non-NVIDIA accelerators and you should be comfortable with both
Hyperscaler-dependent assumptions — our customers run private clouds, on-prem clusters, and air-gapped environments

Compensation & Structure

Location: Round Rock, TX — on-site preferred, hybrid considered
Stage: Pre-Seed (post-announcement)
Compensation: Competitive base meaningful early equity
Reports To: CEO / Co-Founder
Team Size: Small founding team — first engineering hire at this level

How to Apply

Send a connection request to Alex Carter, CEO and Co-Founder, indicating interests in applying.
Upon connection confirmation, follow up with your resume and a cover letter. The cover letter should explain why you are interested in applying and the most complex infrastructure or GPU cluster problem you've solved. We care far more about your technical depth and judgment than the prestige of your past employers.

Apply for this job

Receive alerts for other Director of Engineering job openings

Director of Engineering

What are the responsibilities and job description for the Director of Engineering position at XPerf Inc.?

What is the career path for a Director of Engineering?

Not the job you're looking for? Here are some other Director of Engineering jobs in the Round Rock, TX area that may be a better fit.

We don't have any other Director of Engineering jobs in the Round Rock, TX area right now.

AI Assistant is available now!