What are the responsibilities and job description for the Site Reliability Engineer position at Talener?
Job Title: Site Reliability Engineer (SRE)
Location: Remote (U.S.)
Overview
A fast-growing healthcare technology organization is seeking a Site Reliability Engineer (SRE) to help scale and support a high-impact cloud platform focused on improving healthcare delivery nationwide. This role will play a critical part in strengthening platform reliability, operational efficiency, observability, and automation across production environments.
The ideal candidate is passionate about infrastructure stability, incident response, automation, and continuous improvement within modern cloud-native environments.
Key Responsibilities
Location: Remote (U.S.)
Overview
A fast-growing healthcare technology organization is seeking a Site Reliability Engineer (SRE) to help scale and support a high-impact cloud platform focused on improving healthcare delivery nationwide. This role will play a critical part in strengthening platform reliability, operational efficiency, observability, and automation across production environments.
The ideal candidate is passionate about infrastructure stability, incident response, automation, and continuous improvement within modern cloud-native environments.
Key Responsibilities
- Ensure the reliability, scalability, performance, and security of cloud-based infrastructure and applications
- Monitor, troubleshoot, and resolve production platform and application issues across distributed systems
- Lead incident response efforts, root cause analysis, and blameless post-mortems
- Build and maintain operational runbooks and automated remediation workflows
- Develop and enhance observability and telemetry solutions for proactive monitoring and alerting
- Collaborate closely with engineering, DevOps, QA, security, and operations teams to improve platform health and deployment processes
- Support infrastructure automation and configuration management initiatives
- Contribute to infrastructure-as-code (IaC) practices and CI/CD operational improvements
- Promote best practices around reliability engineering, incident management, and operational excellence
- Participate in an on-call rotation supporting production systems, including occasional off-hours support for West Coast operations
- 5 years of experience in Site Reliability Engineering, DevOps, Cloud Infrastructure, or related disciplines
- Strong experience troubleshooting and supporting production environments
- Hands-on experience with observability and monitoring platforms such as Datadog, New Relic, or similar tools
- Experience working within Azure-based cloud environments and modern containerized infrastructure
- Knowledge of Docker, Kubernetes, and cloud-native application hosting environments
- Experience with infrastructure-as-code tools such as Terraform, Terragrunt, or OpenTofu
- Strong scripting and automation experience using PowerShell, Python, JavaScript, or similar languages
- Experience with source control and CI/CD tooling (Git, Azure DevOps, etc.)
- Understanding of cloud security principles, compliance frameworks, and operational best practices
- Strong collaboration and communication skills within Agile engineering environments
- Experience improving operational visibility through telemetry, dashboards, reports, and alerting systems
- Experience evolving incident response processes and operational tooling
- Passion for mentoring others and promoting operational excellence across teams
- Strong problem-solving mindset with a focus on continuous improvement and automation
- Opportunity to work on mission-driven technology with meaningful real-world impact
- Collaborative engineering culture focused on innovation, reliability, and continuous learning
- Flexible environment that supports work-life balance while maintaining operational excellence