What are the responsibilities and job description for the Lead Software Engineer (Observability & Telemetry) position at griddable.io?

Description

Join the team responsible for innovating and maintaining the massive-scale, distributed systems that monitor Salesforce’s infrastructure.

This position is located in the Bellevue office and requires onsite presence.

The Network Visibility and Telemetry team is responsible for designing, building, and operating a set of systems and services which deliver metrics, telemetry and alerting for data center infrastructure (network, storage, etc). We are part of the Infrastructure Strategy Datacenter Operations organization, which is a dynamic, global team delivering and supporting technology infrastructure to meet the substantial growth needs of the business.

In this role, you will leverage your experience in building and deploying large-scale systems to automate systems services across all types of infrastructure (storage, network, server), enable the collection of infrastructure telemetry, make the infrastructure visible and accessible, and ensure that alerts are generated where action is needed.

Responsibilities

Design, build, and operate large-scale observability systems that deliver metrics, telemetry, and alerting across data center infrastructure including network and storage environments
Develop and maintain distributed services in Java and/or Python to enable automated collection of infrastructure telemetry at scale, ensuring full visibility into critical systems
Build and deploy automation solutions using tools such as Ansible, Puppet, or Chef to streamline infrastructure services across storage, network, and server environments
Publish and consume REST APIs to integrate telemetry pipelines and expose infrastructure data to downstream systems and stakeholders
Drive alerting frameworks that surface actionable signals from infrastructure telemetry, reducing noise and ensuring the right teams are notified when intervention is needed
Partner with a global, cross-functional Infrastructure Strategy Datacenter Operations team to support rapid growth, leveraging CI/CD practices (Jenkins), source control (Git), and Linux (RedHat) expertise to deliver reliable, scalable observability tooling
Build and ship high-quality, production-grade software using modern engineering practices, with AI as a core part of your development workflow by pushing the boundaries of AI development tools to deliver secure, optimized, and high-quality code.
Design and orchestrate complex systems where AI agents integrate seamlessly into human workflows, driving efficiency and innovation at scale.
Contribute to building and maintaining the shared system context, an explicit repository of system designs, constraints, and standards that enables AI to operate accurately and reliably.
Critically evaluate code (Human or AI-generated) for correctness, quality, security, and performance

Required Skills

A related technical degree required.
8 years of proven experience with supporting a codebase for distributed services implemented in Java and/or Python
Experience with automation of systems services and processes.
Excellent analytical and problem-solving skills
A long-standing practice of using Source Control (e.g. git) and unit testing
Experience in publishing and consuming REST APIs
CI/CD experience with Jenkins
Knowledge of Linux (RedHat) including configuration, packages, services, daemons, shells, and troubleshooting
Experience with configuration automation tools such as Ansible, Puppet, and/or Chef.
Experience in fast-paced, technical environments experiencing rapid growth and change
Ability to adapt, to be flexible, and to learn quickly in a dynamic environment
Excellent organizational skills including ability to prioritize tasks efficiently with high level of attention to detail
Ability to work under tight deadlines while coordinating several projects at a time and responding to changing business and technical conditions
A demonstrated, genuine AI-first approach to engineering. Using AI to move faster, build fluency across the stack, and contribute well beyond your core specialty.
Experience using AI tools (e.g., Claude Code, GitHub Copilot, Codex, Cursor, etc.) in development workflows
Advanced prompt engineering skills and the ability to write precise, structured prompts and cultivate the system context that makes AI outputs reliable, secure, and production-ready.

Desired Skills

Experience with the monitoring and alerting of network infrastructure - routers, switches, load balancers, etc. - in a high-availability, always-on datacenter environment
Experience with the monitoring and alerting of storage infrastructure - switches, arrays, etc - in a high-availability, always-on data center environment
Experience with container orchestration systems, i.e., Docker and Kubernetes
Experience with Terraform, Helm, and Spinnaker.
Strong Network Engineering Skills: SNMP, BGP, OSPF or ISIS, LAN switching technologies, backbone, load balancers, IPv4/IPv6 addressing and subnetting.
Experience with application protocols and troubleshooting for the same (i.e., HTTP, HTTPS, TCP/UDP)
Experience with application databases and document stores, e.g. Elasticsearch, Cassandra
Experience in writing systems automation in a high level language such as python.
Experience building AI agents or LLM-powered tools for operational or infrastructure use cases (e.g., triage agents, anomaly detection, AI-assisted diagnosis)
Hands-on experience with AI agent infrastructure — such as agent runtimes, tool/function calling frameworks (e.g., MCP), secure execution environments, or context management for LLM-based systems

Apply for this job

Receive alerts for other Lead Software Engineer (Observability & Telemetry) job openings

Lead Software Engineer (Observability & Telemetry)

What are the responsibilities and job description for the Lead Software Engineer (Observability & Telemetry) position at griddable.io?

What is the career path for a Lead Software Engineer (Observability & Telemetry)?

Job openings at griddable.io

Not the job you're looking for? Here are some other Lead Software Engineer (Observability & Telemetry) jobs in the Bellevue, WA area that may be a better fit.

We don't have any other Lead Software Engineer (Observability & Telemetry) jobs in the Bellevue, WA area right now.

AI Assistant is available now!