What are the responsibilities and job description for the Site Reliability Engineer position at AllSTEM Connections?
Job Title: Site Reliability Engineer (SRE)
Location - Seattle WA
Pay - Depending on experience
1 year contract to full time opportunity
Job Description
We are looking for a skilled Site Reliability Engineer (SRE) to join our team and support the reliability, scalability, and efficiency of our production systems. The ideal candidate will have strong experience in monitoring (Grafana), automation (Ansible), configuration management, and implementing best-in-class operational practices. You will work cross-functionally with development, infrastructure, and DevOps teams to automate processes, optimize system performance, and ensure seamless service availability.
Responsibilities
- Design, implement, and maintain monitoring dashboards using Grafana to ensure high system visibility and proactive issue detection.
- Develop and maintain automation scripts and playbooks using Ansible for deployment, configuration, and operational workflows.
- Manage and improve configuration management across distributed environments to ensure consistency, security, and compliance.
- Support CI/CD pipelines and automate repetitive tasks to enhance deployment speed and reliability.
- Troubleshoot production issues, perform root cause analysis, and drive long-term remediation.
- Collaborate with engineering teams to improve the reliability and performance of applications and infrastructure.
- Implement best practices around infrastructure as code, incident management, and operational excellence.
- Participate in on-call rotations as required.
Required Skills & Qualifications
- 10 years of experience in Site Reliability Engineering, DevOps, or Systems Engineering roles.
- Strong hands-on experience with Grafana for monitoring, metrics visualization, and alerting.
- Proficient in Ansible for automation and configuration management.
- Experience with CI/CD tools (Jenkins, GitLab CI, GitHub Actions, etc.).
- Familiarity with cloud environments (AWS, Azure, or GCP).
- Strong understanding of networking, load balancing, and distributed systems.
- Ability to automate manual processes and improve operational efficiency.
Preferred Skills
- Knowledge of Docker / Kubernetes or container orchestration.
- Experience with Terraform or other IaC tools.
- Background in incident management and post-mortem analysis.