What are the responsibilities and job description for the Senior Site Reliability Engineer position at Insight Global?
Insight Global is hiring a Senior SRE for our client's applied AI and data science program. You’ll help deploy, monitor, support, and optimize software solutions that expand custom experimentation and configuration management platforms. You’ll work with cloud-based applications and resources hosted in AWS. The applications and services are written using Node.js, React, and Python, the infrastructure uses Terraform, and GitLab is used to manage the deployment pipelines. As a member of the CMX team, you’ll be part of a supportive, cross-functional team that values mentorship, continuous learning, and shared success. You’ll work closely with software engineers, test engineers, and DevOps engineers to keep these mission critical solutions highly available and performant.
The CMX team is responsible for enhancing and supporting custom, enterprise-level experimentation and configuration management platforms. We facilitate experimentation and configuration management through web-based UIs and supporting services that enable users to draft and deploy A/B tests, progressive deployments, and JSON configuration changes. We are committed to building a team with diverse experience where everyone can bring their full self to the team.
Responsibilities:
· Release Management
o Build and deploy application, service, and infrastructure releases
o Validate system operation and integrity post-deployment
o Document release notes
· Production Support
o Maintain 99.999% availability of critical production systems
o Ensure infrastructure and applications run smoothly
o Keep infrastructure resources updated
o On-call for production outage and incident response
o Perform root cause analysis for all production outages and incidents
· Monitoring and Alerting
o Apply monitoring and alerting policies to every system and provide recommendations
o Build and expand monitoring dashboards
o Monitor and log errors, bugs, and unexpectant behavior
o Monitor system efficiency including latency and resource consumption
o Monitor for system degradation, pre-empt or mitigate as possible
o Alert the appropriate people when monitoring detects operations outside of expected SLAs
· Optimization
o Manage resource scaling to align with project goals
o Optimize resource usage and system behavior
· Team Participation
o Assist with user support
o Become the resident expert of the system architecture, deployment pipelines, and resource utilization
o Coordinate efforts with onshore and offshore teammates
o Develop bug fixes
We’re looking for candidates with experience in the following areas. If you meet many of these qualifications and are excited about the role, we encourage you to apply—even if you don’t meet every requirement.
Primary Qualifications:
· Expertise using monitoring tools such as DataDog and/or Splunk
· Experience with AWS infrastructure and services (e.g. EKS, S3, DocumentDB)
· Experience working with containerized microservice and web-based applications
· Proficient with the AWS console and Infrastructure as Code (e.g. Terraform)
· Experience benchmarking and performance testing
· Experience deploying cloud-based applications
· Familiarity with Git-based source control and branch management (e.g. GitLab)
· Bachelor’s degree in a related field or equivalent experience
Secondary Qualifications:
· 6 or more years of professional experience with the software development lifecycle
· Familiarity with Python, Node.js, React, TypeScript, and GraphQL
· Exposure to relational (SQL) and NoSQL databases
· Familiarity with Docker, Kubernetes, Redis, and an ORM built over a relational database
· Experience with experimentation, statistical testing, and data analysis
· Master’s degree or higher in a related field