What are the responsibilities and job description for the DevOps & Site Reliability Lead-Retail Devops position at K&K Global Talent Solutions?
Job Description
Must Have Technical/Functional Skills
Technology and Programming (Expert Level)
- Strong proficiency in Java full stack developer
- Object-Oriented programming principles and concepts
- Hands-on experience with Spring Framework (Spring Boot, Spring MVC, Spring Security)
- Knowledge if RESTful API development
- Experience with database like Oracle, DB2, MySQL
- Proficiency in Payment Switch BASE24 EPS, C , AS400 and Python is also added advantage
Domain, Cloud & Platform Engineering
- Must have domain experience on Retail Point of Sale/Payment Systems/Merchandising/Inventory/Logistics area
- Expertise in Microsoft Azure, including:
- Compute (VMs, App Services, Azure Container Apps)
- Containers & Orchestration (AKS, Docker)
- Networking (VNETs, Private Endpoints, Application Gateway, Load Balancers)
- Storage, Azure Key Vault, Azure Monitor, Log Analytics
- Proven experience designing enterprise‑grade, highly available cloud platforms
DevOps & Engineering Excellence
- Advanced experience with Azure DevOps and CI/CD pipeline architecture
- Strong scripting skills (PowerShell, Bash)
- GitOps concepts, branching strategies, release orchestration
Site Reliability Engineering (Leadership Level)
- Ownership of platform reliability, resiliency, and performance
- Definition and governance of:
- SLIs, SLOs, SLAs
- Error budgets and reliability metrics
- Advanced observability strategy, designing and implementation:
- Metrics, logs, traces, alerts, dashboards using Dynatrace
- Incident response leadership, RCA facilitation, and long‑term remediation planning
- Experience operating 99.9%–99.99% availability systems
Security, Compliance & Cost
- Secure cloud design using Key Vault, managed identities, RBAC
- Cost optimization (FinOps mindset) across cloud infrastructure
Roles & Responsibilities
- Act as Lead SRE for client's Retail platforms, owning reliability and stability outcomes
- Define and enforce SRE standards, best practices, and operating models
- Architect and govern highly available, scalable cloud platforms
- Lead the design and implementation of CI/CD and IaC strategies
- Establish proactive monitoring, alerting, and incident prevention mechanisms
- Own major incident leadership, RCA execution, and corrective action tracking
- Partner with application, security, and architecture teams to build reliability by design
- Drive automation to reduce toil and improve operational efficiency
- Mentor and coach SRE and DevOps engineers across teams
- Influence roadmap decisions with a reliability, scalability, and cost lens
Salary : $120,000 - $160,000