What are the responsibilities and job description for the Site Reliability Engineer position at Prosum?
We're looking for an SRE-minded engineer who is passionate about observability, pipeline health, and promoting reliability best practices across teams. You'll work closely the platform team to define and monitor service-level objectives (SLOs), build out golden dashboards, and help other teams understand and implement effective monitoring, alerting, and tracing practices using tools like Datadog.
This role is perfect for someone who combines deep technical knowledge in infrastructure monitoring with strong communication skills—someone who can not only build and maintain reliability tools but also evangelize SRE practices and guide teams on how to leverage them.
What You’ll Do:
- Define and implement SLOs and help teams align around them.
- Create and maintain golden dashboards to monitor system and pipeline health.
- Promote SRE best practices across engineering teams who may not have SRE experience.
- Help analyze incidents and close the feedback loop to improve reliability.
- Work hands-on with observability tools (e.g., Datadog) for alerting, monitoring, and tracing.
- Act as a go-to resource and coach for teams wanting to improve their systems’ reliability and observability.
What We’re Looking For:
- Strong experience with modern observability tools (Datadog preferred).
- Solid understanding of infrastructure monitoring, alerting, and tracing.
- Experience defining and operationalizing SLOs.
- A people-oriented mindset with the ability to communicate SRE principles effectively.
- Comfortable enabling other teams and sharing knowledge, not just building in a silo.
Salary : $70 - $80