What are the responsibilities and job description for the Manager, Site Reliability Engineering and Incident Management position at Planet DDS?
Planet DDS is a leading provider of a platform of cloud-based solutions that empowers growth-minded dental businesses. Now serving over 13,000 practices and 118,000 customers in North America, Planet DDS delivers a comprehensive suite of solutions, including Denticon Practice Management, Cloud 9 Ortho Practice Management, and Apteryx Cloud Imaging. Planet DDS is dedicated to enabling dental support organizations (DSOs) and groups to grow and thrive with technology that delivers seamless integrations, improved workflows, and future-proof scalability.
We are seeking a Manager, Site Reliability Engineering and Incident Management, to manage our Site Reliability Engineering function as well as our external incident response function for our production operations. To be successful, the manager will need to be self-motivated, communicate clearly, and operate with a sense of urgency in a fast-paced environment. Providing operational support means that you will leverage your customer empathy to production incidents and to any other internal engineering-related support requests. It will be crucial for you to gain a deep understanding of our systems and architecture and build a hands-on knowledge of support and observability tooling. You will need to be available to engage in any incident escalations 24x7. You will need to seek answers from subject matter experts in a variety of positions from architects to support staff, business leaders, and technically minded developers.
Why are we here?
Dental software is broken. We aim to fix it.
Where are we headed?
To be the first choice for growth-minded dental businesses.
How do we get there?
To Encourage Measurable Progress Toward Our Vision And Make The Best Decisions On Behalf Of Employees And Customers, We Adopted a Set Of Common Values
We are seeking a Manager, Site Reliability Engineering and Incident Management, to manage our Site Reliability Engineering function as well as our external incident response function for our production operations. To be successful, the manager will need to be self-motivated, communicate clearly, and operate with a sense of urgency in a fast-paced environment. Providing operational support means that you will leverage your customer empathy to production incidents and to any other internal engineering-related support requests. It will be crucial for you to gain a deep understanding of our systems and architecture and build a hands-on knowledge of support and observability tooling. You will need to be available to engage in any incident escalations 24x7. You will need to seek answers from subject matter experts in a variety of positions from architects to support staff, business leaders, and technically minded developers.
- Location: East Coast (US)
- Team Leadership & Development
- Lead and mentor a team of SREs and Incident Managers.
- Foster a culture of reliability, accountability, and continuous improvement.
- Collaborate with engineering teams to design resilient platform architectures.
- Incident Management
- Oversee the incident response process for outages and service disruptions.
- Ensure timely detection, escalation, and resolution of incidents.
- Drive post-incident reviews (PIRs) and root cause analysis.
- Implement improvements based on lessons learned to prevent recurrence.
- Operational Excellence
- Mature and enforce best practices for incident response and runbooks.
- Automate operational tasks to reduce toil and improve efficiency.
- Maintain observability tools (monitoring, alerting, logging).
- Process & Governance
- Define and maintain incident management policies and escalation procedures.
- Drive initiatives for chaos engineering, capacity planning, and disaster recovery testing.
- 7 years in SRE, DevOps, or Infrastructure roles.
- 3 years in Incident Management leadership.
- Deep understanding of reliability, scalability, and performance optimization.
- Multi-cloud expertise in AWS, Azure, or GCP.
- Understanding of DNS, load balancing, firewalls, and compliance frameworks.
- Security is part of everything we do and will require your knowledge of fundamental cloud security (e.g., identity and access management, firewalls, etc.)
- Deep understanding of logging and monitoring and security best practices
- Strong collaboration and communication skills
- Bachelor’s Degree in a relevant major or equivalent years of experience is a plus
- Dental industry knowledge
- Experience working in B2B SaaS companies
- Experience with cloud containers, specifically Kubernetes
Why are we here?
Dental software is broken. We aim to fix it.
Where are we headed?
To be the first choice for growth-minded dental businesses.
How do we get there?
To Encourage Measurable Progress Toward Our Vision And Make The Best Decisions On Behalf Of Employees And Customers, We Adopted a Set Of Common Values
- Collaborative – Working independently and across teams, we create scalable solutions to enable company growth
- Empathetic – We are educated on the experience of our customers and feel vested in their success
- Accountable – We feel ownership for the quality of our work and take pride in the positive outcomes
- Trustworthy – We operate with integrity and honest, making promises we know that we can keep
- Ambitious – We are driven by our ability to make a long-term, positive impact on the lives of dental market leaders