What are the responsibilities and job description for the Site Reliability Engineer position at Cyberobotix?
Job Title: Senior Site Reliability Engineer (SRE)
Location: Atlanta, GA(Hybrid)
Duration: Long Term
Work Mode: A hybrid work schedule will be followed where Team A will work from the office on Thursday, Friday [Sat and Sun off-days] Monday, Tuesday, and Wednesday while Team B works remotely during the same period. In the following week, the schedule will rotate, with Team B working from the office and Team A working remotely
Job Description:
Required Skillset
- Manage and optimize data streaming and API components in OpenShift Onpremise and AWS.
- Proactively review the application’s APIs and processes to identify opportunities to optimize the response times for various application components.
- Automate various types of testing including data quality checks, automate delivery to production, and automate deployment for production.
- Develop integrations between the application in Onpremise and AWS and our third-party tools (ServiceNow, VersionOne, Sumo).
- Work with teams to create SLI/SLO’s.
- Actively monitor and lead troubleshooting of degraded performance and hard-to-define issues for the platform applications, develop the solution, and document artifacts in the backlog from root cause analysis.
- Evolve the cloud infrastructure ecosystem for our application suite by experimenting with emerging technologies and completing prototypes to understand benefit.
- Design and develop CI/CD pipelines to deploy various application artifacts, including APIs and Data Process Jobs.
- Analyze, design, and develop the artifacts to configure monitoring and alerting metrics so the support engineers can proactively and timely validate, troubleshoot, and resolve issues.
- Maintain data integrity and access control by using AWS security tools and services such as HSM, IAM, etc.
- Understand and develop tools to monitor AWS billing for services, generate cost-related reports, and help develop and implement cost optimization strategies.
- Work with enterprise security architects to design and implement data security tools, measures, data encryption, and key management.
- Design and develop solutions to address security vulnerabilities discovered by internal security audit teams, vendors, and the security community.
- Design and develop solutions for support teams to regularly scan and review to fix security issues.
- Regularly and proactively monitor and analyze the capacity and performance of the platform.
- Work with the architecture team to design and implement elastic infrastructure to accommodate irregular bursts of user traffic/requests.
- Work with the architecture team to develop backup strategies and implement backup solutions for critical data and application components for service restoration and disaster recovery purposes.
- Work with architecture, infrastructure, and application teams to provide input on continuous improvement in design, performance, and security enhancements.
Desired Skillset
- Deep understanding of the operations of AWS cloud platforms.
- Must be well versed in automation, scripting, and monitoring, including use of tools from major cloud platforms such as OpenShift, CloudFormation, Terraform, Ansible, Shell, and Python.
- Preferable candidates with significant technical knowledge across infrastructure layers, including:
- Linux OS
- Major virtualization platforms
- Traditional and software-defined networking
- Load balancers
- Firewalls
- API tools
- Performance and intelligent monitoring tools
- Storage
- Backup strategy
- Significant knowledge and experience in end-to-end operations for enterprise systems and applications, including driving issue resolution for mission-critical systems.
- Must have experience working to automate, operationalize, and improve Development/QA using CI/CD tools (GitLab, GitHub, Jenkins, Maven, Gradle, Nexus).
- Working experience with Software Release Management.
Desired Qualification
- BS degree in Computer Science or a related technical field, or equivalent practical experience.
Minimum Experience
- 3 years of related DevOps or SysOps engineering experience with a focus on major cloud platforms (AWS preferred).
- 2 years of application development experience, including data streaming and deploying/monitoring high availability critical application components.
- 1 years in a Site Reliability Engineering (SRE) organization preferred.
- Overall 4–6 years of experience.