What are the responsibilities and job description for the Site Reliability Engineering Manager position at Trustech?
NOTE: Candidates requiring sponsorship now or in the future (including CPT/OPT) cannot be considered for this job
No C2C
Candidates will be required to work on site 3 days per week in south Salt Lake County
Onsite interviews are required. Local candidates only
SRE/Platform Engineering Manager
Overview
We are seeking a hands-on Platform Engineering/SRE Manager to lead a small, high-impact team responsible for maintaining and improving the reliability, performance, and scalability of our production systems. This role blends technical leadership and operational excellence, managing a group of Site Reliability and Platform Engineers who ensure our applications and infrastructure run smoothly in production.
The ideal candidate is a player-coach, comfortable leading incident response efforts, mentoring engineers, and still contributing technically through infrastructure automation, observability improvements, and system reliability enhancements.
Key Responsibilities
- Lead and mentor a team of SREs and Platform Engineers (currently five members) focused on production stability, system automation, and operational readiness.
- Own the reliability lifecycle, driving proactive monitoring, on-call response leadership, and post-incident reviews to minimize downtime and improve service quality.
- Develop and evolve infrastructure automation using Terraform, Helm, and related Infrastructure-as-Code practices to standardize deployments and reduce manual interventions.
- Partner with product, software, and operations teams to implement scalable cloud solutions that meet performance and resiliency targets.
- Oversee observability and telemetry using tools like Grafana, Azure Insights, Datadog, or Dynatrace, ensuring comprehensive visibility into system health.
- Drive the definition and tracking of SLOs, SLIs, and SLAs, helping teams measure and continuously improve reliability standards.
- Collaborate with engineering leads to enhance developer platform capabilities like automating workflows, managing CI/CD pipelines, and simplifying environment provisioning.
What You’ll Bring
- Bachelor’s degree in Computer Science, Information Technology, or equivalent practical experience.
- 7 years in infrastructure, SRE, or platform engineering roles, including 3 years in leadership or team management.
- Strong background in cloud infrastructure (AWS, Azure, or GCP) and hands-on experience with IaC tools such as Terraform.
- Familiarity with CI/CD pipelines, container orchestration, and deployment frameworks (e.g., Jenkins, GitHub Actions, Kubernetes, Docker).
- Experience improving system observability, developing dashboards, and managing alerting systems using Grafana or similar platforms.
- Competence in Python, Go, or C# for automation and troubleshooting.
- Solid understanding of relational databases (SQL) and the ability to guide teams in identifying and resolving performance bottlenecks.
- Demonstrated ability to lead incident management, communicate effectively across teams, and create a culture of continuous improvement.
Preferred Experience
- Experience with developer enablement or internal platform engineering initiatives (e.g., self-service infrastructure or environment provisioning).
- Familiarity with data-driven operational metrics and applying analytics to improve system reliability.
- Prior experience managing a hybrid or remote technical team across time zones.
Work Style
- Approximately 30% hands-on technical contribution and 70% team leadership, process improvement, and coordination.
- Availability to participate in daytime and occasional off-hours on-call support rotations.
- Commitment to building a proactive, reliability-first culture that values automation, transparency, and cross-functional collaboration.