Demo

Senior Site Reliability Engineer

Team TAG Services, LLC
Chicago, IL Full Time
POSTED ON 12/17/2025
AVAILABLE BEFORE 1/16/2026
The Aspen Group (TAG) is one of the largest and most trusted retail healthcare business support organizations in the U.S. and has supported over 20,000 healthcare professionals and team members with close to 1,500 health and wellness offices across 48 states in four distinct categories: dental care, urgent care, medical aesthetics, and animal health. Working in partnership with independent practice owners and clinicians, the team is united by a single purpose: to prove that healthcare can be better and smarter for everyone. TAG provides a comprehensive suite of centralized business support services that power the impact of five consumer-facing businesses: Aspen Dental, ClearChoice Dental Implant Centers, WellNow Urgent Care, Chapter Aesthetic Studio, and Lovet Pet Health Care. Each brand has access to a deep community of experts, tools and resources to grow their practices, and an unwavering commitment to delivering high-quality consumer healthcare experiences at scale. As a Senior Site Reliability Engineer (SRE) at TAG – The Aspen Group, you will be responsible for ensuring the reliability, performance, and scalability of our core systems. This role involves proactively building and managing, monitoring solutions, lead incident response, and continuously optimizing system performance to exceed business objectives. We are actively integrating AI and machine learning into our operational workflows, and you will be on the front lines, leveraging intelligent automation and machine learning to build a proactive resilient infrastructure. This is an opportunity to go beyond SRE by applying cutting-edge technology to solve complex reliability challenges. Responsibilities: Intelligent Site Reliability Engineering: Design and build highly scalable and resilient systems to support our applications and services, incorporating predictive analytics to anticipate reliability risks. Develop and manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) using machine learning anomaly detection to ensure systems meet reliability targets. Drive improvements in system reliability, availability, and performance through proactive measures, automation, and intelligent failure prediction. Advanced Observability: Implement and manage comprehensive monitoring and alerting solutions, integrating with intelligent observability platforms that reduce alert noise and correlate events. Develop and maintain dashboards and reporting tools that provide data-driven insights for actionable troubleshooting recommendations and performance optimization. Evaluate and integrate advanced monitoring tools and operational intelligence platforms to enhance observability and root cause identification. Proactive Incident Management: Lead and participate in incident response efforts, using intelligent log analysis and automated event correlation to speed up troubleshooting and root cause identification. Develop and maintain incident management processes incorporating automated decision support systems to improve response times and minimize service disruptions. Conduct post-incident reviews, using automated pattern recognition and trend analysis to identify systemic issues and implement preventive measures. Performance and Capacity Optimization: Analyze performance metrics and logs, supported by advanced observability tools, to detect bottlenecks and inefficiencies. Collaborate with development teams to implement automated profiling and optimization recommendations for code and infrastructure improvements. Perform capacity planning using machine learning forecasting models to ensure systems can handle current and future loads. Automation and Process Improvement: Develop and implement automation solutions, including intelligent runbook automation, self-healing systems, and automated incident triage. Identify and drive process improvements by applying machine learning to operational data for continuous optimization. Maintain documentation that includes automation and machine learning guidelines for monitoring, incident management, and SRE best practices. Collaboration and Communication: Work closely with engineering, operations, and product teams to align reliability and monitoring goals, including automation adoption strategies. Communicate effectively with stakeholders, providing regular updates on system health, incidents, performance improvements, and data-driven insights. Foster a culture of collaboration, knowledge sharing, and automation best practices within the team and across the organization. Requirements: Bachelor's degree in computer science or a related technical field. At least 5 years of experience in Site Reliability Engineering or a similar role. Strong proficiency in at least one programming language such as Python, Go, or C# Demonstrated experience applying machine learning and automation to operational workflows such as monitoring, alerting and incident response. Expertise with infrastructure as code tools such as Terraform Proven experience working and monitoring container environments such as Cloud Run and Kubernetes. Hands-on experience using and working within an Azure, AWS, and GCP environment (GCP preferred) Strong understanding of networking, distributed systems, and cloud infrastructure. Familiarity with intelligent monitoring platforms and operational analytics tools such as Prometheus, Grafana, OpenSearch, Sentry, Google Cloud Observability Excellent problem-solving skills and the ability to work independently and as part of a team. Experience with incident management, root cause analysis, and automated operational workflows. Annual pay range: $129,000-$160,000 A generous benefits package that includes paid time off, health, dental, vision, and 401(k) savings plan with match The Aspen Group (TAG) is one of the largest and most trusted retail healthcare business support organizations in the U.S. and has supported over 16,000 healthcare professionals and team members at more than 1,200 health and wellness offices across 46 states in four distinct categories: Dental care, urgent care, medical aesthetics, and animal health. Working in partnership with independent practice owners and clinicians, the team is united by a single purpose: to prove that healthcare can be better and smarter for everyone. TAG provides a comprehensive suite of centralized business support services that power the impact of five consumer-facing businesses: Aspen Dental, Clear Choice Dental Implant Centers, WellNow Urgent Care, Chapter Aesthetic Studio, and AZPetVet. Each brand has access to a deep community of experts, tools, and resources to grow their practices, and an unwavering commitment to delivering high-quality consumer healthcare experiences at scale. Privacy Policy | The Aspen Group (TAG)

Salary : $129,000 - $160,000

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Senior Site Reliability Engineer?

Sign up to receive alerts about other jobs on the Senior Site Reliability Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$32,915 - $41,304
Income Estimation: 
$36,623 - $47,329
Income Estimation: 
$169,957 - $202,398
Income Estimation: 
$151,875 - $212,356
Income Estimation: 
$120,143 - $165,703
Income Estimation: 
$76,670 - $90,826
Income Estimation: 
$91,609 - $118,978
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$120,933 - $155,034
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Team TAG Services, LLC

  • Team TAG Services, LLC Chicago, IL
  • About the Company The Aspen Group (TAG) is one of the largest and most trusted retail healthcare business support organizations in the U.S. and has support... more
  • 13 Days Ago

  • Team TAG Services, LLC Nashville, TN
  • Part-time Dental Hygienist ClearChoice Dental Implant Centers are a national network of dental implant centers founded in 2005 to provide innovative dental... more
  • 14 Days Ago

  • Team TAG Services, LLC Strongsville, OH
  • At Aspen Dental, we put You first, offering the security and job stability that comes with working with a world-class dental service organization (DSO). Ou... more
  • 14 Days Ago

  • Team TAG Services, LLC Fort Wayne, IN
  • At Aspen Dental, we put You first, offering the security and job stability that comes with working with a world-class dental service organization (DSO). Ou... more
  • 15 Days Ago


Not the job you're looking for? Here are some other Senior Site Reliability Engineer jobs in the Chicago, IL area that may be a better fit.

  • Algo Capital Group Chicago, IL
  • A leading high-frequency trading firm is seeking a mid to senior-level Site Reliability Engineer with deep Kubernetes experience to join their Infrastructu... more
  • 1 Month Ago

  • CME Group Chicago, IL
  • In the modern landscape of high-stakes technology, a Senior Site Reliability Engineer isn't just a role; it's a mission to build and safeguard the digital ... more
  • 17 Days Ago

AI Assistant is available now!

Feel free to start your new journey!