What are the responsibilities and job description for the Site Reliability Engineer position at PANGEATWO?
Growing enterprise technology organization is seeking an experienced Site Reliability Engineer (SRE) to support large-scale distributed applications and cloud-based transformation initiatives. This is a high-impact role focused on improving system reliability, scalability, automation, and operational resilience across mission-critical platforms.
The ideal candidate will bring a strong combination of software engineering, infrastructure, cloud, and operational support experience with a passion for automation, monitoring, and continuous improvement.
Candidates must be authorized to work in the U.S. without sponsorship.
Key Responsibilities- Monitor system performance, availability, and reliability across enterprise platforms
- Gather and analyze operational metrics to improve fault tolerance and scalability
- Partner closely with development teams to improve deployment, testing, and release processes
- Support cloud-based transformation initiatives and platform modernization efforts
- Participate in system architecture, platform management, and capacity planning activities
- Drive automation initiatives to reduce manual intervention and improve system resilience
- Troubleshoot infrastructure, application, network, and performance-related issues
- Respond to incidents and support restoration of critical services
- Implement proactive monitoring, alerting, and observability improvements
- Support continuous improvement initiatives focused on performance, reliability, and operational excellence
- Bachelor’s degree or equivalent experience
- 5 years of experience in Site Reliability Engineering, DevOps, Systems Engineering, or related areas
- Strong understanding of Kubernetes, containers, clustering, and elastic scalability
- Experience supporting cloud environments, preferably Google Cloud Platform (GCP)
- Experience with microservices and API/service-based architectures
- Strong troubleshooting skills across infrastructure, networking, databases, operating systems, and security
- Architecture-level knowledge of Linux and Windows systems
- Experience with CI/CD pipelines and deployment automation
- Experience supporting enterprise production environments and monitoring platforms
Experience with the following technologies is highly preferred:
- Kubernetes
- Google Cloud Platform (GCP)
- Terraform
- Prometheus
- Grafana
- Dynatrace
- Azure DevOps (ADO)
- CI/CD tools and automation frameworks
- HTTP, proxies, and modern web technologies
- Passionate about scalability, stability, and system performance
- Proactive problem-solver with strong analytical abilities
- Comfortable working in fast-paced, high-availability environments
- Strong collaborator who works effectively across engineering and operations teams
- Focused on automation, innovation, and continuous improvement
This is an excellent opportunity to join a forward-thinking technology environment where reliability engineering plays a strategic role in platform growth and modernization.