Demo

Senior AI and HPC Observability Engineer

NVIDIA and Careers
Santa Clara, CA Full Time
POSTED ON 3/13/2026
AVAILABLE BEFORE 5/12/2026
NVIDIA is a pioneer in accelerated computing, known for inventing the GPU and driving breakthroughs in gaming, computer graphics, high-performance computing, and artificial intelligence. Our technology powers everything from generative AI to autonomous systems, and we continue to shape the future of computing through innovation and collaboration. Within this mission, our team, Managed AI Superclusters (MARS) builds and scales the infrastructure, platforms, and tools that enable researchers and engineers to develop the next generation of AI/ML systems. By joining us, you’ll help design solutions that power some of the world’s most advanced computing workloads.
Observability is at the heart of this transformation. We are looking for a strong AI & HPC Observability Engineer to build and scale next-generation Observability and Telemetry platforms. You will design and develop high-throughput, reliable telemetry pipelines and modern data infrastructure. This role requires solid distributed systems fundamentals, production-grade coding, and a passion for operational excellence.
What You Will Be Doing:
  • Design and scale observability platforms handling high-volume metrics, logs, and traces across distributed environments
  • Build high-performance backend services for telemetry ingestion, processing, and routing
  • Develop and extend OpenTelemetry collectors, processors, exporters, and instrumentation libraries
  • Build and optimize metrics pipelines using large-scale time-series storage systems
  • Design and operate real-time and batch telemetry pipelines using streaming and distributed data technologies
  • Improve platform reliability, performance, and cost efficiency through tuning, capacity planning, and system optimization
  • Develop monitoring, alerting, and service reliability frameworks to ensure platform health and performance
  • Collaborate with platform engineering, infrastructure, and site reliability teams to deliver production-grade observability solutions
What We Need to see:
  • Bachelor’s degree in Computer Science, Computer Engineering, or related field or equivalent experience
  • 5 years of experience building backend or distributed systems in production environments
  • Strong programming skills in Python, Go, or Java, with experience developing production-quality software
  • Hands-on experience with modern observability architectures, including metrics, logs, and traces
  • Solid experience with PromQL and time-series data systems
  • Experience building or operating distributed data pipelines using technologies such as Kafka, Spark, or Flink
  • Experience working with Kubernetes and cloud-native infrastructure
  • Strong understanding of distributed systems, concurrency, and fault-tolerant system design. Strong debugging, performance tuning, and production operations skills
Ways To Stand Out from The Crowd:
  • Proven experience designing and scaling observability platforms for AI, GPU, or HPC environments
  • Hands-on expertise with OpenTelemetry, Prometheus, Kafka, and high-volume distributed telemetry pipelines
  • Strong background in data engineering, time-series data modeling, and real-time performance tuning
  • Experience integrating observability with AI/ML pipelines, GPU workload monitoring, or intelligent alerting
  • Demonstrated use of statistical or machine learning techniques for anomaly detection, correlation, or predictive insights
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4.
You will also be eligible for equity and
benefits
.
Applications for this job will be accepted at least until March 6, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Salary : $152,000 - $287,500

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Senior AI and HPC Observability Engineer?

Sign up to receive alerts about other jobs on the Senior AI and HPC Observability Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$119,030 - $151,900
Income Estimation: 
$149,493 - $192,976
Income Estimation: 
$117,024 - $149,811
Income Estimation: 
$137,568 - $176,908
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at NVIDIA and Careers

  • NVIDIA and Careers Santa Clara, CA
  • Are you passionate about inspiring change, building data driven tools to improve software quality, and ensuring customers have the best experience? If so, ... more
  • 5 Days Ago

  • NVIDIA and Careers Santa Clara, CA
  • We are looking for a driven, diligent leader to join our Technical Program Management team for Compute Platform Software. You will work with engineering an... more
  • 6 Days Ago

  • NVIDIA and Careers Santa Clara, CA
  • We are seeking a skilled, motivated Senior Software Technical Program Driver to lead our efforts in crafting innovative compute software solutions for key ... more
  • 7 Days Ago


Not the job you're looking for? Here are some other Senior AI and HPC Observability Engineer jobs in the Santa Clara, CA area that may be a better fit.

  • GenBio AI Palo Alto, CA
  • Headquartered in Silicon Valley, we are a newly established start-up, where a collective of visionary scientists, engineers, and entrepreneurs are dedicate... more
  • 1 Day Ago

  • Sage Care Palo Alto, CA
  • Role Overview Own and build the full diagnostic, observability, and RCA infrastructure that makes Sage Care’s AI assistant trustworthy and debuggable —in r... more
  • 1 Day Ago

AI Assistant is available now!

Feel free to start your new journey!