What are the responsibilities and job description for the Site Reliability Engineer position at TPA technologies?
This is a Full-Time opportunity w/ 3x per week onsite in Webster, MA
The Site Reliability Engineer (SRE) is a critical part of our client's USA On-Prem and Cloud platform strategy. In this role, you will be focused on ensuring the development platform and processes enable our software engineers to focus more on innovation than infrastructure. This role will drive the adoption of observability best practices and develop automations for resolving recurring issues. You must be comfortable working with software engineering teams and supporting their demanding needs to ensure the security, availability, and performance of the platform. This engineer must be capable of triaging issues on the front line as well as framing strategic initiatives from leadership. Being hands-on keyboard is a must for this role with a focus on developing reliability engineering for the client Platforms.
- You will set standards for the monitoring of clients on-prem and Cloud infrastructure and applications.
- You will ensure the platform target SLAs are met and implement appropriate SLIs for supporting services.
- As a key member of the Critical Incident Response team, use expert communication and troubleshooting skills to aid the team in an efficient resolution.
- You will work with developers during service transition, evaluating reliability and operability of the applications and ensuring adequate monitoring, alerting and observability.
- You will partner with peers within Operations & Infrastructure supporting ongoing maintenance and enhancement of the platform.
- 5 or more years of work experience with a bachelor’s degree or 4 or more years of relevant experience with an advanced degree. Master’s Degree in IT, CS or related field preferred and/or 5 years relevant work experience.
- Hands on experience in Linux and Windows systems and good understanding of distributed computing environments.
- Intermediate level programming and/or scripting in 3 or more of the following: Python, PowerShell, JavaScript, Terraform, Ansible, etc.
- 2 years of experience managing CI/CD tooling such as Jenkins, Github, Bitbucket, DevOps in a large-scale environment.
- 3 Years experience managing observability tooling such as Splunk, Dynatrace, etc. in a large-scale environment.
- Advanced understanding of YAML, JSON, HTML, XML.
- 2 years of work experience supporting relational and non-relational databases (MySQL, MongoDB, PostgreSQL, etc.), including creating and running queries, managing performance and scaling
- 3 or more years leading a Platform, SRE or Production Engineering group for high availability/critical platforms/applications.
- Experience managing a distributed platform including but not limited to deployment/release management, provisioning, capacity management, workload management