What are the responsibilities and job description for the Site Reliability Lead Engineer (SRE) position at SumasEdge Corporation?
Required Skills
- 12 years of Software Engineering experience
- 4 years in Site Reliability Engineering
- Strong experience with APM / monitoring tools (Splunk, ELK, Grafana, Prometheus)
- Experience with distributed systems, relational & NoSQL databases
- Knowledge of Redis, Memcache, MQ, Kafka
- Handson Shell scripting, Ansible (YAML)
- Experience with CI/CD tools (Git, Jenkins, UCD or similar)
- Experience with Kubernetes / OpenShift, PCF, AWS or Azure
- Tech stack: Java/J2EE, Spring Boot, Python, Kafka, Oracle, MongoDB
- Enhance platform reliability, performance, and observability
- Build dashboards and alerts using APM tools (Splunk, ELK, Grafana, Prometheus, GCL)
- Proactively identify performance bottlenecks and system risks
- Support incident management and root cause analysis
- Collaborate with Engineering, Security, Networking, and Infrastructure teams
- Automate operational tasks using Shell scripting and DevOps tools
- Support CI/CD pipelines and release processes