What are the responsibilities and job description for the Senior Cloud Reliability Engineer position at Careers Integrated Resources Inc?
Job title:- Senior Cloud Reliability Engineer
Location:- Richmond, VA 23219
Duration:- 12 Months
Job Description:-
Qualifications:
- 5-7 years of extensive experience in end-to-end enterprise software development life cycle experience including maintenance and support
- 3 years of experience in Observability and SRE practices.
- 3 years of experience in Cloud Networking domain (experience with Routers, Firewalls, Load Balancers, etc.)
- Bachelor’s degree in computer science, Information Systems, or equivalent background or equivalent experience.
- Extensive knowledge and experience of working in AWS environments
- Knowledge of Azure is a plus
- Strong Software development experience in Cloud with one of the languages: Python or GoLang
- Experience with observability, open telemetry, and in one or more of the tools like Dynatrace, Prometheus, Grafana, AWS CloudWatch, AWS Canary, AWS event bridge
- Expertise in automating the TOIL
- Working experience in Agile and Scaled Agile environments
- Experience supporting infrastructure for large multi-services applications
- Knowledge of secure coding standards and banking environment is a plus.
- Desirable to have AWS Certifications (AWS Certified Solutions Architect and AWS Certified SysOps Administrator)
Responsibilities:
- As the Senior Cloud Reliability Engineer in the SRE Service, they will be accountable for implementing reliability practices with software as means for the cloud foundational product line in the client.
- The SRE Service is part of the Cloud Solutions & Services department and has overall responsibility for reliability of the numerous cloud foundational environments in the FRS.
What Will Be Expected of You
- Works part of cloud foundational platform squads focused on Cloud Networks to demonstrate and champion site reliability culture and practices and exerts technical influence throughout your team
- Develops and maintains automations, scripts and code associated with automating manual work, improving reliability and stability of the cloud platform
- Develops, integrates and maintains synthetics (canaries) code to establish health of the services
- Leads SLIs, SLOs, Error budgets efforts in collaboration with product team to instrument, visualize for proactively managing the stability of cloud platforms
- Implement observability (logs, metrics, traces) and monitoring for Cloud Network components like VPC, VPN Tunnels, GWLB, and Transit Gateway using tools like SevOne, Grafana, Dynatrace, AWS CloudWatch and AWS Canary
- Respond to and resolve incidents in a timely manner
- Use Infrastructure as Code (IaC) tools like Terraform to manage AWS resources.
- Develops reusable artifacts and software utilities to industrialize SRE practices across FRS
- Other duties assigned as necessary