What are the responsibilities and job description for the Site Reliability Engineer position at Scale.jobs?
About The Role
The role focuses on the reliability, scalability, and performance of large-scale distributed systems. This involves balancing the development of new infrastructure features with the rigorous operational oversight required to maintain high availability across global production environments.
The team builds and manages the underlying platform that powers the developer experience. This includes automating manual processes, improving system observability, and refining incident response workflows to ensure the platform can scale alongside rapid user growth.
Key Responsibilities
The role focuses on the reliability, scalability, and performance of large-scale distributed systems. This involves balancing the development of new infrastructure features with the rigorous operational oversight required to maintain high availability across global production environments.
The team builds and manages the underlying platform that powers the developer experience. This includes automating manual processes, improving system observability, and refining incident response workflows to ensure the platform can scale alongside rapid user growth.
Key Responsibilities
- Design and implement Infrastructure-as-Code (IaC) using Terraform or Pulumi to manage multi-region cloud environments on AWS or GCP
- Develop and maintain Kubernetes-based container orchestration platforms, including service mesh configurations and custom controllers
- Build automated observability pipelines using Prometheus, Grafana, and Jaeger to monitor system health and provide deep tracing capabilities
- Drive incident response through on-call rotations, leading post-mortem analyses and implementing preventative measures to eliminate recurring failure modes
- Optimize CI/CD pipelines for speed and safety, ensuring high-frequency deployments do not compromise system stability or security
- Engineer automated self-healing mechanisms and scaling policies to handle unpredictable traffic spikes in production
- 3–7 years of experience in SRE, DevOps, or Infrastructure Engineering roles managing high-traffic production systems
- Expertise with containerization and orchestration tools, specifically Docker and Kubernetes (EKS, GKE, or self-managed)
- Strong proficiency in at least one backend language used for systems automation, such as Go, Python, or Rust
- Hands-on experience with cloud-native networking, including VPCs, load balancing, DNS, and CDN configurations
- Deep understanding of Linux internals, performance tuning, and troubleshooting distributed systems at scale
- Bonus: Experience with service mesh technologies (Istio/Linkerd), eBPF for profiling, or managing large-scale NoSQL databases