What are the responsibilities and job description for the Mid. level Site Reliability Engineer (SRE) position at Catapult Solutions Group?
12-month contract for a Mid. level Site Reliability Engineer (SRE) in Richardson, TX (onsite mandatory: 3-days per week, days vary, candidates should be flexible)
***We can ONLY consider candidates local to Richardson, TX for this role***
As a Site Reliability Engineer supporting backend services for a large scale SaaS collaboration platform, you will play a critical role in ensuring the reliability, scalability, and resilience of services used by millions of users globally. This role focuses on operational excellence, automation, and continuous improvement across cloud and hybrid environments.
Responsibilities:
- Own deployment, operation, and reliability of critical collaboration services across cloud and hybrid environments
- Participate in on call rotations for production systems, respond to alerts and incidents, and drive timely mitigation and resolution
- Lead and contribute to production incident response, including triage, mitigation, root cause analysis, and post incident reviews
- Design, enhance, and operate CI CD pipelines and automation frameworks to improve deployment safety, reliability, and recovery
- Use standard alerting, monitoring, and observability tooling to detect issues, reduce mean time to recovery, and improve service health
- Develop and maintain runbooks, escalation procedures, and operational documentation to support reliable production operations
- Leverage observability and operational data to support capacity planning, scaling decisions, and resource optimization
- Establish and promote operational best practices and a culture of reliability, accountability, and continuous improvement
Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience
- 3-5 years of experience in Site Reliability Engineering, Cloud Operations, DevOps, or a related role
- Hands-on experience operating production services in cloud or hybrid environments
- Production experience with containerized workloads using Docker and Kubernetes, including deployments, troubleshooting, and scaling
- Experience participating in on call rotations and responding to real world production incidents with a sense of urgency and proactive action
- Proficiency in one or more scripting or programming languages such as Python, Go, or Bash for automation and operational tooling
- Experience building or maintaining CI/CD pipelines and supporting production deployments and rollbacks
- Experience using Infrastructure as Code tools such as Terraform or Ansible to manage production infrastructure
- Solid understanding of Linux systems, networking, distributed systems, Git based development workflows, and infrastructure operations
- Experience operating workloads in at least one major cloud platform such as AWS, Azure, or GCP, including core services like IAM, networking, and compute
Required Tools and Technologies:
- Docker
- Kubernetes
- Linux
- Python
- Go
- Bash
- CI CD platforms
- Git based version control
- Cloud platforms including AWS, Azure, or GCP
- Infrastructure as Code tooling
- Monitoring and observability tools such as Prometheus, Grafana, Datadog, or CloudWatch
- Incident management and alerting tools such as PagerDuty or equivalent
Salary : $40 - $42