What are the responsibilities and job description for the Senior Site Reliability Engineer - AI/Automation Focus position at Aptino?
Job Overview:
We are looking for a Senior Site Reliability Engineer (SRE) with strong experience in production support, cloud infrastructure, and automation. This role focuses on managing and improving highly available systems while gradually introducing AI-driven automation to streamline operations and incident response.
This is a senior-level role, ideal for candidates who can handle production environments independently and improve system reliability through automation.
Key Responsibilities:
< data-start=986 data-end=1030>1. Production Support & Reliability- Manage and support production systems in a cloud environment (Azure preferred)
- Participate in on-call rotation and handle high-priority incidents
- Perform root cause analysis and lead post-incident reviews
- Monitor systems using dashboards, alerts, SLIs, and SLOs
- Troubleshoot issues across Java applications, Kubernetes, and cloud infrastructure
- Work with cross-functional teams to improve system stability
< data-start=1467 data-end=1512>2. Automation & AI-Driven Operations
- Build automation solutions to reduce manual operational work
- Develop AI-assisted workflows for incident detection, triage, and resolution
- Create tools to analyze logs, metrics, and system alerts
- Implement safe automation for tasks like restarts, scaling, and rollbacks
- Generate automated incident reports and communication summaries
Required Skills:
- 12 years of experience in SRE / Production Support / DevOps
- Strong experience with:
- Azure Cloud
- Kubernetes & Docker
- Java-based applications
- CI/CD (GitHub Actions or similar)
- Monitoring tools (Dynatrace preferred)
- Experience with scripting/automation (Python, Bash, Ansible)
- Solid understanding of SRE concepts (SLI, SLO, error budgets)
Preferred Skills:
- Experience in automation of production workflows
- Exposure to AI/ML or AI-based automation tools
- Experience with multi-system or distributed environments
- Background in regulated industries (e.g., healthcare) is a plus
What We’re Looking For:
- Strong production support experience
- Ability to handle incidents independently
- Experience working in fast-paced environments
- Willingness to be part of on-call rotation
- Comfortable supporting multiple time zones
Salary : $95 - $105