What are the responsibilities and job description for the Site Reliability Engineer position at Apt?
Site Reliability Engineer
Location: Birmingham, AL
Type: Direct hire/full time
Day to day responsibilities:
- Working with Application Development teams to architect and deploy new cloud (Google) based solutions.
- Responsible for improving system reliability and resilience.
- This role focuses on building automation to reduce manual effort and prevent service-impacting incidents.
- The SRE combines software and systems engineering to build and support large-scale, distributed, fault-tolerant systems. This role ensures that critical platforms are available, reliable and able to support a fast rate of improvement.
- This role relies on monitoring platforms and is continually taking a holistic view of system health and performance.
- The SRE will enhance and support cloud-based transformations, and is focused on pushing capabilities forward, staying ahead of customer needs and innovating for continuous improvement.
- The SRE provides operational support and engineering for multiple large-scale distributed software applications.
What you need to have:
- Bachelor's degree or equivalent experience
- 5-7 years of experience (prefer minimum of 3 years in SRE position)
- Strong GKE experience (Google Kubernetes Engine)
- Proficient in Kubernetes, SRE principles, and cloud services (GCP).
- Experience with Infrastructure as Code (IaC) using Github and Terraform
- Experience with Dynatrace, New Relic, or SolarWinds
- Skilled in microservice architecture and infrastructure troubleshooting.
- Experienced in deploying, monitoring, and supporting enterprise applications.
- Proficient in CI/CD tools and performance optimization.
- Knowledge of web technologies and tools like Azure DevOps, Dynatrace, Prometheus, Terraform, and Grafana.
Nice to have:
- Grafana
- Splunk