What are the responsibilities and job description for the Lead Software Engineer (Observability & Telemetry) position at griddable.io?
Description
Join the team responsible for innovating and maintaining the massive-scale, distributed systems that monitor Salesforce’s infrastructure.
This position is located in the Bellevue office and requires onsite presence.
The Network Visibility and Telemetry team is responsible for designing, building, and operating a set of systems and services which deliver metrics, telemetry and alerting for data center infrastructure (network, storage, etc). We are part of the Infrastructure Strategy Datacenter Operations organization, which is a dynamic, global team delivering and supporting technology infrastructure to meet the substantial growth needs of the business.
In this role, you will leverage your experience in building and deploying large-scale systems to automate systems services across all types of infrastructure (storage, network, server), enable the collection of infrastructure telemetry, make the infrastructure visible and accessible, and ensure that alerts are generated where action is needed.
Responsibilities
Join the team responsible for innovating and maintaining the massive-scale, distributed systems that monitor Salesforce’s infrastructure.
This position is located in the Bellevue office and requires onsite presence.
The Network Visibility and Telemetry team is responsible for designing, building, and operating a set of systems and services which deliver metrics, telemetry and alerting for data center infrastructure (network, storage, etc). We are part of the Infrastructure Strategy Datacenter Operations organization, which is a dynamic, global team delivering and supporting technology infrastructure to meet the substantial growth needs of the business.
In this role, you will leverage your experience in building and deploying large-scale systems to automate systems services across all types of infrastructure (storage, network, server), enable the collection of infrastructure telemetry, make the infrastructure visible and accessible, and ensure that alerts are generated where action is needed.
Responsibilities
- Design, build, and operate large-scale observability systems that deliver metrics, telemetry, and alerting across data center infrastructure including network and storage environments
- Develop and maintain distributed services in Java and/or Python to enable automated collection of infrastructure telemetry at scale, ensuring full visibility into critical systems
- Build and deploy automation solutions using tools such as Ansible, Puppet, or Chef to streamline infrastructure services across storage, network, and server environments
- Publish and consume REST APIs to integrate telemetry pipelines and expose infrastructure data to downstream systems and stakeholders
- Drive alerting frameworks that surface actionable signals from infrastructure telemetry, reducing noise and ensuring the right teams are notified when intervention is needed
- Partner with a global, cross-functional Infrastructure Strategy Datacenter Operations team to support rapid growth, leveraging CI/CD practices (Jenkins), source control (Git), and Linux (RedHat) expertise to deliver reliable, scalable observability tooling
- Build and ship high-quality, production-grade software using modern engineering practices, with AI as a core part of your development workflow by pushing the boundaries of AI development tools to deliver secure, optimized, and high-quality code.
- Design and orchestrate complex systems where AI agents integrate seamlessly into human workflows, driving efficiency and innovation at scale.
- Contribute to building and maintaining the shared system context, an explicit repository of system designs, constraints, and standards that enables AI to operate accurately and reliably.
- Critically evaluate code (Human or AI-generated) for correctness, quality, security, and performance
- A related technical degree required.
- 8 years of proven experience with supporting a codebase for distributed services implemented in Java and/or Python
- Experience with automation of systems services and processes.
- Excellent analytical and problem-solving skills
- A long-standing practice of using Source Control (e.g. git) and unit testing
- Experience in publishing and consuming REST APIs
- CI/CD experience with Jenkins
- Knowledge of Linux (RedHat) including configuration, packages, services, daemons, shells, and troubleshooting
- Experience with configuration automation tools such as Ansible, Puppet, and/or Chef.
- Experience in fast-paced, technical environments experiencing rapid growth and change
- Ability to adapt, to be flexible, and to learn quickly in a dynamic environment
- Excellent organizational skills including ability to prioritize tasks efficiently with high level of attention to detail
- Ability to work under tight deadlines while coordinating several projects at a time and responding to changing business and technical conditions
- A demonstrated, genuine AI-first approach to engineering. Using AI to move faster, build fluency across the stack, and contribute well beyond your core specialty.
- Experience using AI tools (e.g., Claude Code, GitHub Copilot, Codex, Cursor, etc.) in development workflows
- Advanced prompt engineering skills and the ability to write precise, structured prompts and cultivate the system context that makes AI outputs reliable, secure, and production-ready.
- Experience with the monitoring and alerting of network infrastructure - routers, switches, load balancers, etc. - in a high-availability, always-on datacenter environment
- Experience with the monitoring and alerting of storage infrastructure - switches, arrays, etc - in a high-availability, always-on data center environment
- Experience with container orchestration systems, i.e., Docker and Kubernetes
- Experience with Terraform, Helm, and Spinnaker.
- Strong Network Engineering Skills: SNMP, BGP, OSPF or ISIS, LAN switching technologies, backbone, load balancers, IPv4/IPv6 addressing and subnetting.
- Experience with application protocols and troubleshooting for the same (i.e., HTTP, HTTPS, TCP/UDP)
- Experience with application databases and document stores, e.g. Elasticsearch, Cassandra
- Experience in writing systems automation in a high level language such as python.
- Experience building AI agents or LLM-powered tools for operational or infrastructure use cases (e.g., triage agents, anomaly detection, AI-assisted diagnosis)
- Hands-on experience with AI agent infrastructure — such as agent runtimes, tool/function calling frameworks (e.g., MCP), secure execution environments, or context management for LLM-based systems