What are the responsibilities and job description for the Senior Java Site Reliability Engineer position at Ekfrazo Technologies Private Limited?
Role: Senior Java Site Reliability Engineer
Exp: 16-20 Years
Job Type: Contract
Project: Hybrid
Location: McLean, VA
Industry: Banking / Financial Services
Key Responsibilities
- Support and maintain highly available production platforms across cloud and distributed environments. Drive incident management, root cause analysis, problem management, and platform stability initiatives.
- Monitor and maintain uptime of Java applications and microservices.
- Proactively identify and resolve application performance bottlenecks.
- Conduct root cause analysis (RCA) for application outages and incidents.
- Implement resiliency patterns including circuit breakers, retries, and failover mechanisms.
- Lead reliability engineering efforts focused on system availability, performance optimization, and operational excellence. Implement and enhance observability solutions including monitoring, logging, alerting, and incident response automation.
- Collaborate with development, infrastructure, and cloud engineering teams to improve deployment reliability and operational efficiency. Support infrastructure modernization, cloud transformation, and platform automation initiatives.
- Coordinate disaster recovery testing, resiliency validation, capacity planning, and production readiness reviews. Provide technical leadership and mentor offshore/onshore engineering teams.
Required Experience
- 16–20 years of experience in Site Reliability Engineering (SRE), Production Engineering, Platform Engineering, or Application Support.
- Strong experience supporting large-scale enterprise production environments. Proven background in incident management, problem management, and operational support.
- Experience working within banking, financial services, fintech, or other highly regulated industries. Hands-on experience supporting mission-critical applications with stringent availability and performance requirements.
Required Skills
- Java
- Linux/Unix Administration
- Kubernetes and Container Platforms
- Docker
- Cloud Platforms (AWS, Azure, or GCP)
- CI/CD Tools (Jenkins, GitHub Actions, GitLab CI/CD, ArgoCD)
- Infrastructure as Code (Terraform, Ansible)
- Monitoring & Observability Tools (Splunk, Datadog, Grafana, Prometheus, Moogsoft)
- ServiceNow, JIRA, Confluence
- Python, Bash, or Shell Scripting
- SQL and Database Troubleshooting
- Application Performance Monitoring (APM)
- Production Release Management
- Disaster Recovery and High Availability Architectures
Education
- Bachelor's degree in Computer Science, Information Systems, Engineering, or a related technical discipline. Top of FormBottom of Form