What are the responsibilities and job description for the Software Engineer - AI Infrastructure position at Harell Data?
About the Role - Onsite in Palo Alto CA or Bellevue WA (not eligible for relocation assistance)
As a founding AI Infrastructure Engineer, you will report directly to the CTO and lead the development of our core compute and orchestration layer. This is a high-impact role where you will hold a significant ownership stake in the company and lead the 0-to-1 build of our infrastructure. You will work closely with our customers to translate their needs into a world-class platform, while simultaneously shaping our engineering culture and technical direction from the ground up.
What You Will Do
- Architect GPU Compute Fabric: Build and manage the orchestration layer for GPU workloads, ensuring efficient resource allocation and cost management for large-scale training, fine-tuning, and inference.
- Design Developer Interfaces: Build developer-centric SDKs and APIs that transform complex ML workflows into intuitive experiences for researchers and data scientists.
- Operationalize the ML Lifecycle: Develop robust, end-to-end pipelines-from data ingestion and preprocessing to secure model serving and monitoring.
- Client Success & Observability: Work closely with customers to debug fine-tuning jobs and build the observability tools required to track model performance and resource health in real-time.
- Define Systems & Culture Strategy: Lead the technical roadmap by making critical "build vs. buy" decisions on infrastructure and security, while directly shaping the team’s engineering standards and hiring processes.
Qualifications
- 5 years of software engineering experience, with focus on ML infrastructure or backend systems supporting ML workload.
- Experience deploying and operating ML/DL training or inference pipelines in production (PyTorch, Hugging Face, or similar).
- Hands-on experience with Kubernetes on AWS/GCP, ideally for GPU workloads
- Strong CS fundamentals and system design skills.
- Ability to thrive in fast-paced, dynamic environments and navigate ambiguity.
This is an onsite role in Palo Alto CA or Bellevue WA only and not eligible for relocation assistance