What are the responsibilities and job description for the Site Reliability Engineering (SRE) Lead position at Gotham Technology Group?
SRE Lead
Hybrid – 2- 3 days onsite, Contract to hire role
Our direct client is seeking an SRE Lead to help define and scale observability and reliability capabilities across an enterprise environment.
This is a hands-on leadership role where you will shape observability strategy, build scalable monitoring solutions, and drive adoption across infrastructure, application, and platform teams.
Some key highlights:
- Opportunity to drive a critical reliability and observability function across the organization
- High visibility role with influence across engineering and architecture teams
- Strong investment in cloud, platform engineering, and modernization initiatives
- Competitive compensation and benefits
What You’ll Do
- Lead the design and implementation of observability and monitoring solutions across cloud, on-prem, and hybrid environments
- Define and drive standards and best practices for reliability, monitoring, and telemetry
- Build and scale telemetry pipelines including metrics, logs, and traces
- Implement modern observability frameworks
- Partner with infrastructure, application, security, and data teams to embed observability into system design
- Establish governance around telemetry lifecycle including data retention, granularity, and cost optimization
- Evaluate, implement, and optimize tools such as Prometheus, Grafana, ELK, and Azure Monitor
- Act as a technical leader, influencing architecture decisions and driving adoption across engineering teams
Qualifications
- Experience in SRE, observability, infrastructure engineering, and/or DevOps environments
- Strong hands-on experience with observability and monitoring tools such as Prometheus, Grafana, ELK stack, and Azure Monitor
- Experience with OpenTelemetry, eBPF, and modern telemetry standards
- Proven experience building or improving observability platforms in enterprise environments
- Strong understanding of cloud platforms (Azure preferred), networking, and distributed systems
- Experience with infrastructure-as-code tools such as Terraform or Ansible and CI/CD pipelines
- Exposure to Kubernetes or other containerized environments
- Strong architectural and problem-solving skills with the ability to design scalable solutions
- Excellent communication skills and ability to work across multiple teams and stakeholders
Nice to Have
- Experience with tools such as SolarWinds, OpsRamp, or ExtraHop
- Experience in large-scale or regulated environments
- Prior experience leading initiatives or mentoring engineers