Demo

Senior DevOps Lead Engineer (AI Acceleration)- Hybrid

Calance
santa clara, CA Remote Contractor
POSTED ON 6/21/2026
AVAILABLE BEFORE 7/21/2026
The Role

You will be the senior DevOps technical lead on the Infrastructure team, owning the CI/CD pipelines, container infrastructure, observability stack, and shared tooling that AI/ML hardware accelerator development runs on in the lab, in the cloud, and across colocations at scale.

Because we design and manufactures AI acceleration silicon, a core part of this is working with internal cloud and lab physical systems: automating and operating on-premises GPU clusters, high-speed interconnects, and lab server infrastructure not just cloud resources. You will build the automation layer that ties lab hardware, cloud environments, and developer tooling into a single, reliable system.

You will also be instrumental in scaling that system globally, as they build toward a follow-the-sun DevOps model across its expanding engineering sites.

What You Will Do
DevOps Leadership
Own CI/CD pipelines, runners, and execution environments across software, silicon, hardware, and ML teams GitLab CI, GitHub Actions, and build systems like Bazel.
Build and maintain automated provisioning and deployment pipelines for GPU driver stacks, AI/ML frameworks (PyTorch, TensorFlow), and inference software; implement container-based test harnesses (Docker/Kubernetes/Singularity) that verify driver and framework compatibility across hardware generations (NVIDIA, AMD, Intel).
Improve pipeline performance through parallelization, caching, and architectural changes; maintain the Docker image library supporting AI/ML workload testing across distributions and framework versions.

Automation & Infrastructure as Code
Own IaC and configuration management (Terraform, Ansible, Python, Go, Bash) across lab, on-prem, colo, and cloud (AWS, Azure, Google Cloud Platform) covering GPU/CPU driver provisioning through infrastructure deployments, with remote state management, environment isolation, and plan validation.
Build automation to eliminate toil and enforce consistency across team workflows; implement auto-remediation where appropriate with blast-radius controls and approval gates for production systems.
Operate and automate Kubernetes clusters and HPC container environments (Singularity/Apptainer) across cloud and on-premises installation, upgrades, workload management, and troubleshooting.

Observability, Reliability & Incident Response
Design and maintain dashboards, alerting, and monitoring (PrometheGrafana, DataDog) across CI runners, lab hardware, GPU utilization, and shared services; define SLOs/SLIs and lead structured incident response when they are breached.
Lead incident triage from bare metal to application layer resolving infrastructure, software, and hardware faults across CI/CD, lab, container, and cloud environments, including GPU drivers, framework crashes, and network issues.

Documentation & Global Collaboration
Create and maintain high-quality documentation: architecture diagrams, troubleshooting guides, onboarding materials, and API/tool references.
Partner with Global DevOps and SRE team members to build a consistent, scalable operating model.
Serve as a technical resource across engineering teams developing and sharing best practices, raising technical debt and reliability risks early, and always coming with a proposed plan.
Drive innovation by supporting R&D activities and leading proof-of-concept (POC) and proof-of-value (POV) evaluations for new tooling, infrastructure patterns, and accelerator technologies.

What You Will Bring
Required
Bachelor's or Master's in Computer Science, Electrical Engineering, or related field with 10 years of hands-on DevOps/infrastructure experience (8 years minimum).
Deep Linux systems expertise: package management, networking (TCP/IP stack, routing, bonding), storage, systemd, kernel parameters, and performance tuning.
Production-grade Git based CI/CD experience: pipeline design, runner management, merge request workflows, caching, and artifact handling.
Strong Python and/or Bash scripting for automation, with the ability to write clean, tested, maintainable code not just one-off scripts.
Hands-on Ansible experience writing playbooks from scratch for complex, multi-host configuration scenarios and mentoring team members on Ansible and IaC best practices.
Docker/container expertise: multi-stage builds, registry management, security scanning, and container networking.
Kubernetes operational experience: cluster lifecycle, workload debugging, storage, networking, and RBAC.
Prometheus Grafana observability stack: metric instrumentation, alert design, and dashboard development.
Experience supporting AI/ML or HPC workloads on GPU or accelerator hardware including driver installation, framework compatibility, and hardware-level troubleshooting.
Comfort operating in fast-moving startups: you ship, document, and iterate not wait for perfect requirements.
Cross-site or follow-the-sun DevOps technical leadership experience.

Strongly Preferred
Production Go and/or Python for DevOps services pipeline validators, health-check microservices, or auto-remediation agents beyond scripting.
Experience with artifact repositories such as Harbor, Nexus, Artifactory, or GitLab Package Registry.
Job scheduling systems: Slurm, LSF, or similar HPC-style cluster job control.
Knowledge of CPU/GPU architectures and high-speed interconnect fabrics: InfiniBand, RoCE (RDMA over Converged Ethernet), or NVLink.
Prior experience speaking at technical conferences or writing public-facing technical documentation/blog posts

Salary : $80 - $110

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Senior DevOps Lead Engineer (AI Acceleration)- Hybrid?

Sign up to receive alerts about other jobs on the Senior DevOps Lead Engineer (AI Acceleration)- Hybrid career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$119,030 - $151,900
Income Estimation: 
$149,493 - $192,976
Income Estimation: 
$92,369 - $122,605
Income Estimation: 
$117,024 - $149,811
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Calance

  • Calance DENVER, CO
  • CALL US NOW for immediate consideration! Click Apply on Web or Apply Now to view our recruiter s contact info and reach out today, we d love to speak with ... more
  • 1 Day Ago

  • Calance Nashville, TN
  • Technical Analyst Provide tier 3 support for a large enterprise environment, encompassing a large remote user base, and has primary responsibility for clie... more
  • 1 Day Ago

  • Calance Phildelphia, PA
  • Location: Philadelphia, PA, Kubernetes Platform Engineer Voice Services Modernization Position Summary The Kubernetes Platform Engineer Voice Services Mode... more
  • 1 Day Ago

  • Calance Richardson, TX
  • Position Overview: We are seeking an experienced RF Shielded Walk-In Chamber Project Manager to serve as an independent contractor overseeing the full life... more
  • 1 Day Ago


Not the job you're looking for? Here are some other Senior DevOps Lead Engineer (AI Acceleration)- Hybrid jobs in the santa clara, CA area that may be a better fit.

  • SmithRx San Jose, CA
  • Who We Are: SmithRx is a rapidly growing, venture-backed Health-Tech company. Our mission is to disrupt the expensive and inefficient Pharmacy Benefit Mana... more
  • 15 Days Ago

  • Intuitive Sunnyvale, CA
  • Company Description It started with a simple idea: what if surgery could be less invasive and recovery less painful? Nearly 30 years later, that question s... more
  • 16 Days Ago

AI Assistant is available now!

Feel free to start your new journey!