What are the responsibilities and job description for the Site Reliability Engineering Manager position at TBG | The Bachrach Group?
A leading global investment management firm is seeking an exceptional SRE Manager to lead and transform their Site Reliability Engineering function. This is a newly created leadership role offering a unique opportunity to build and scale a world-class global SRE team while driving the organization's evolution toward modern, cloud-based infrastructure solutions.
The Role
As SRE Manager, you will lead the strategic expansion of site reliability practices across a global organization, transforming operational workflows into proactive, automation-driven processes. You'll build and mentor a high-performing team responsible for ensuring the reliability, scalability, and performance of both cloud and on-premise infrastructure and services.
Key Responsibilities
Team Leadership & Development
- Lead the growth and development of a global SRE team from a small, high-performing unit to a comprehensive function
- Oversee recruitment, onboarding, and professional development initiatives
- Design and deliver tailored training programs on SRE principles, cloud operations, and automation tools
- Foster a culture of excellence, collaboration, and continuous improvement
Strategic Transformation
- Evaluate current operational workflows, RACIs, and skill assessments across the global team
- Execute a comprehensive roadmap to transition reactive operations into proactive, SRE-aligned processes
- Identify and eliminate toil through automation and process optimization
- Define and implement sustainable automation frameworks to reduce operational risk
Technical Excellence
- Collaborate with architects, platform engineering, ServiceNow developers, and application teams to design and implement comprehensive observability frameworks
- Define, monitor, and regularly review SLIs, SLOs, SLAs, and error budgets
- Enhance proactive incident detection capabilities and reduce MTTR
- Oversee incident response processes and champion blameless post-mortem culture across teams
Stakeholder Management
- Build strong partnerships with internal and external stakeholders
- Prepare and present operational performance reports to leadership
- Drive alignment between SRE and application development teams
Required Qualifications
Leadership Experience
- Proven track record building and leading operational and engineering teams
- Demonstrated ability to foster collaboration between SRE and development teams
- Experience driving operational excellence and reducing downtime while accelerating delivery cycles
SRE & Incident Management Expertise
- Strong experience defining and monitoring SRE principles (SLIs, SLOs, SLAs, error budgets)
- Skilled in incident response, post-incident analysis, and facilitating blameless post-mortems
- Track record of implementing proactive measures to prevent recurring incidents
Technical Proficiency
- Deep expertise in Azure technologies (experience with other cloud providers highly beneficial)
- Proven experience with Infrastructure as Code tools, particularly Terraform
- Hands-on experience with monitoring and observability tools such as Logic Monitor, Azure Monitor, Prometheus, Grafana, Dynatrace, and Splunk
- Strong scripting or programming skills (Python, PowerShell)
- Experience with ServiceNow and Azure DevOps
- Understanding of container orchestration platforms
- Solid grasp of Agile, ITIL, and ITSM frameworks
Preferred Qualifications
- Experience managing other managers
- SharePoint administration experience
- Demonstrated success spearheading automation initiatives that significantly reduced infrastructure provisioning time
Soft Skills
- Excellent communication and presentation abilities
- Strong stakeholder management capabilities
- Strategic thinking with hands-on execution ability
Why This Role?
This is a rare opportunity to shape the SRE function of a global organization from the ground up. You'll have the autonomy to build processes, develop talent, and drive meaningful transformation that directly impacts business outcomes. If you're passionate about reliability engineering, team development, and driving operational excellence at scale, this role offers the platform to make a significant impact.
Salary : $155,000 - $180,000