What are the responsibilities and job description for the SRE Lead Engineer position at Litmus7?
Role Summary
- We are looking for a hands-on Lead SRE Engineer to work from onsite and own production reliability, observability, incident response, and operational improvements for enterprise-scale ecommerce systems.
- The candidate should be technically strong, able to lead P1/P2 incident triage, work directly with client stakeholders, guide offshore teams, and drive improvements across monitoring, alerting, dashboards, runbooks, and automation.
- Lead onsite production triage for critical incidents and coordinate with application, infrastructure, DevOps, database, network, and offshore teams.
- Monitor and support business-critical ecommerce flows such as checkout, order capture, payment, inventory, promotions, and fulfilment integrations.
- Use Dynatrace and Splunk to analyse logs, metrics, traces, service health, latency, failure rates, and downstream dependencies.
- Build and maintain dashboards for SRE operations, service owners, and leadership visibility.
- Improve alerting by reducing noise, defining meaningful thresholds, and aligning alerts with customer impact and SLOs.
- Drive root cause analysis, post-incident reviews, corrective actions, and preventive improvements.
- Create and maintain runbooks, SOPs, troubleshooting guides, and operational playbooks.
- Identify automation and AI-assisted triage opportunities to improve incident response and operational efficiency.
- Mentor SRE/support engineers and ensure smooth onsite-offshore coordination and handovers.
- Communicate incident status, business impact, risks, and next steps clearly to client stakeholders.
- 8 years of experience in SRE, production support, DevOps, platform engineering, or application operations.
- Strong hands-on experience with Dynatrace and Splunk.
- Good understanding of microservices, APIs, distributed systems, Kubernetes, containers, and cloud platforms.
- Experience supporting high-volume ecommerce or enterprise production systems.
- Strong knowledge of incident management, root cause analysis, monitoring, alerting, and SLO/SLA practices.
- Ability to analyse application performance issues including latency, throughput, error rates, pod restarts, CPU/memory, database latency, and third-party dependency issues.
- Strong communication skills with the ability to explain technical issues to both engineering and leadership teams.
- Experience leading onsite-offshore coordination and mentoring engineers.
- Retail or ecommerce domain experience.
- Experience with order capture, checkout, payment, inventory, or OMS flows.
- Knowledge of Dynatrace DQL, Grail, Smartscape, Davis AI, Open Pipeline, and SLOs.
- Experience with ServiceNow, Jira, PagerDuty, Teams, or similar incident-management integrations.
- Scripting or automation experience using Python, shell scripting, or similar tools.
- Exposure to AI-assisted triage, self-healing, or runbook automation.