What are the responsibilities and job description for the Site Reliability Manager - Director - Hybrid position at Gravity IT Resources?
This is a direct hire opportunity
SRE Manager - Director
Global Technology and Data (GTD)
Location: Charlotte, NC - Hybrid onsite 3 days a week
Salary plus 25% bonus
Job Summary
Highly skilled and motivated SRE Manager/Director to lead our global Site Reliability Engineering team. In this role, you will be responsible for building a high-performing team to manage the reliability, scalability, and performance of our cloud and on-premise infrastructure and services. This newly created role is an exciting opportunity for the right leader to build a global team that can proactively support services as they transition to modern, cloud-based solutions.
Key Responsibilities:
- Lead the expansion of SRE practices from a small and high performing team to a larger global function incorporating on-premise infrastructure technologies.
- Evaluate current operational workflows and RACIs, identify toil and complete assessment of skills across the global team.
- Execute a comprehensive roadmap to transition reactive operational day to day activities into proactive, SRE-aligned processes with a focus on reliability, automation, observability, and incident management.
- Upskill team members through tailored training programs on SRE principles, cloud operations and automation tools.
- Collaborate with architects, platform engineering, ServiceNow developers and application teams to define and implement an observability framework in order to enhance proactive incident detection and reduce MTTR.
- Define and implement an automation framework to ensure sustainable, responsible, and effective use of automation to reduce toil and risk.
- Define and regularly review SLIs, SLOs, SLAs, error budgets, and incident response processes.
- Oversee recruitment, orientation, and professional development of the global SRE team.
- Foster a high-performing team culture.
- Build strong relationships with internal and external stakeholders.
- Prepare and present reports on operational performance.
- Oversee incident response and post-incident analysis processes
Key Requirements:
- Proven experience in building and leading Operational and Engineering teams.
- Adept at fostering collaboration between SRE and application development teams to drive operational excellence, reduce downtime, and help application teams accelerate delivery cycles.
- Have defined and monitored SRE principles including SLIs, SLOs, SLAs, error budgets, and incident response strategies.
- Has overseen incident response processes, skilled in post-incident analysis and conducting blameless post-mortems with multiple teams, driving proactive measures to prevent future incidents.
- Experience of spearheading automation initiatives using Terraform, and significantly reducing infrastructure provisioning time.
- Experience of Monitoring & Observability tools such as Logic Monitor, Azure Monitor, Prometheus, Grafana, Dynatrace and Splunk.
- Experience with ServiceNow and Azure DevOps and solid understanding of Agile, ITIL and ITSM frameworks.
- Strong expertise in Azure technologies. Experience with other CSPs highly beneficial.
- Proficiency in IaC tools including Terraform.
- Experience with Sharepoint administration highly beneficial.
- Experience with container orchestration.
- Strong scripting or programming skills (e.g., Python, Powershell).
- Excellent communication skills.
- Experience in managing other managers highly beneficial.
Salary : $124,000 - $180,000