What are the responsibilities and job description for the Senior Site Reliability Engineer position at Todyl?

Senior Site Reliability Engineer About the Role The Site Reliability Engineering team at Todyl exists to make our platform reliable, secure, and easy for engineering teams to ship to. We do that by building automation, self-service tooling, and operational standards that let developers move fast without putting customers at risk.

Our success is measured by how much production reliability and developer velocity we enable, not by how much work flows through us. This is a senior individual contributor role. You'll own end-to-end design and delivery of the Kubernetes-based platform initiatives that shape how Todyl runs production over the next 2–3 years, mentor and uplevel the rest of the SRE team, and operate as a peer to Architecture and Security on high-stakes platform decisions.

The team is small and rebuilding after recent transitions, and you'll work alongside our Principal SRE as one of the senior anchors of the function.

In this role, we're looking for someone who:

Has 5 years of Site Reliability Engineering or platform-engineering experience and has owned major platform initiatives end-to-end, from design through stabilization, staying with the work until it's truly done rather than declaring victory at deploy. They're recognized as the go-to person in their technical domain and create design documentation that their teams reference long after the work ships.
Mentors less-tenured engineers as a matter of practice. They grow the people around them through pairing, design partnership, and the example they set. * Sees SRE as a service to the engineering organization, not a gate. They build trust with developers and make other teams' jobs easier.
Treats security as a normal part of operating the platform, not an afterthought, and brings demonstrated experience designing systems with security as a first-class concern.
Gets energized by eliminating toil and looks at repetitive work and asks, "How do we make this go away?"
Actively uses AI tooling in their day-to-day work, and influences how the team adopts AI patterns safely.
Can communicate technical decisions clearly to engineers, engineering leadership, and non-engineering stakeholders, and is comfortable saying no or pushing back constructively when it matters.

What you'll do:

Own end-to-end design and delivery of flagship platform initiatives, designing for failure modes, graceful degradation, and the scale we expect 12 months from now rather than just today. The headline 12–18 month deliverable for this role is the golden-path platform: a developer-facing self-service path to production that enforces infrastructure best practices without requiring SRE involvement.
Drive security automation at platform scale, including patching cadence, secret rotation, access controls, and CVE remediation, as ongoing operational practices rather than reactive sprints.
Partner with product engineering teams at the architecture phase of high-stakes systems, helping shape the design rather than reviewing it the week before launch.
Operate as a peer to Architecture and Security on platform decisions that affect how Todyl runs production over the next 2–3 years.
Mentor less-tenured SREs through pairing, code review, and design partnership, with measurable improvement in their autonomy on design and incident work.
Contribute to one or more SRE practice improvements adopted by the team: incident commander discipline, postmortem maturity, change management standards, on-call quality, or design review cadence.
Build and operate the production platform: Kubernetes with Helm and ArgoCD, CI/CD pipelines, infrastructure-as-code (Terraform, Salt), observability (Grafana, Prometheus), secrets management, and AWS (including EKS). We're shifting from reactive to proactive, and we'd rather build guardrails than approve every deploy.
Drive cost visibility and efficiency across our cloud footprint, including AWS resource tagging, COGs attribution, and right-sizing across the platform, and you'll quantify the business impact in terms that leadership can act on.
Participate in a weekly on-call rotation, resolve most issues independently, and own postmortems and follow-up actions for the incidents you respond to.
Plan and estimate honestly, break multi-quarter work into smaller increments, communicate delays early, and write tests for the automation you build because it runs in production.
Treat code review as a quality lever, not a checkbox. Catch missing tests, push back on tech debt, and watch dashboards and logs to verify your own changes after they ship.
When something you've built is mature and stable, you'll look for ways to hand it off or make it self-managing rather than holding onto it forever.

Important note: We expect the person in this role to actively use AI tools, including tools like Claude, to accelerate automation development, reduce toil, and solve infrastructure problems more quickly. At the senior level, we also expect you to influence how the team adopts AI tooling: sharing patterns that work, flagging patterns that don't, and helping the team integrate AI safely into review, incident response, and automation workflows.

As part of our interview process, you'll work through a live AI-paired exercise with a couple of our engineers to see how you approach a real platform problem together.

We don't expect deep knowledge across every item below, but familiarity with several of these will help you ramp quickly.

Most importantly, we're looking for a strong technical background, the willingness to learn what you don't already know, and demonstrated experience operating production platforms at meaningful scale.

Kubernetes (EKS), Helm, ArgoCD, containerization
AWS (including EKS, ECR, and IAM) and cloud-native infrastructure
Infrastructure-as-code (Terraform, Salt) * CI/CD pipelines and GitOps (GitHub Actions, ArgoCD)
Observability stack (Grafana, Prometheus)
Linux at scale
Python or Bash for tooling
Networking fundamentals
Security-conscious infrastructure design (patching, secrets management, access controls)
Git and modern development workflows

Compensation Range: $165K - $185K

Salary : $165,000 - $185,000

Apply for this job

Receive alerts for other Senior Site Reliability Engineer job openings

Senior Site Reliability Engineer

What are the responsibilities and job description for the Senior Site Reliability Engineer position at Todyl?

What is the career path for a Senior Site Reliability Engineer?

Job openings at Todyl

Not the job you're looking for? Here are some other Senior Site Reliability Engineer jobs in the Denver, CO area that may be a better fit.

We don't have any other Senior Site Reliability Engineer jobs in the Denver, CO area right now.

AI Assistant is available now!