What are the responsibilities and job description for the Platform Site Reliability Engineer position at Jobs via Dice?
Dice is the leading career destination for tech experts at every stage of their careers. Our client, MW Partners LLC, is seeking the following. Apply via Dice today!
Responsibilities and duties:
Responsibilities and duties:
- Ensure the reliability, availability, and operational health of a portfolio of SaaSbased solutions, including vendormanaged services and inhouse (customerzero) platforms.
- Participate in oncall rotations and incident response, leading investigation, mitigation, coordination, and postincident followup.
- Establish and maintain effective observability for systems that are not fully owned, identifying practical ways to obtain actionable metrics, logs, and signals from vendor and partner solutions.
- Use operational data and incident learnings to identify reliability risks and drive targeted improvements that reduce customer impact.
- Apply appropriate change controls at owned or influenced layers of the stack, balancing reliability, velocity, and business needs.
- Partner with internal teams and external vendors to communicate expectations, coordinate response and remediation, and influence reliability outcomes.
- Produce clear incident communications and postincident analyses that inform stakeholders and drive lasting improvements.
- Leverage automation and AIassisted tooling to improve detection, triage, and operational efficiency.
- Bachelor s degree in Computer Science, Information Technology, Engineering, or a related field, or equivalent practical experience.
- 3 5 years of experience in Site Reliability Engineering, production operations, or a closely related role.
- Strong foundation in Site Reliability Engineering practices, including observability, incident response, and reliability measurement.
- Handson experience operating SaaS or thirdparty systems where fullstack ownership is limited.
- Deep understanding of monitoring, logging, and alerting, with the ability to design signals that are actionable rather than noisy.
- Proven incident response experience, including oncall participation and crossteam coordination during highimpact events.
- Ability to think creatively and pragmatically when instrumenting and improving systems with constrained control.
- Excellent written and verbal communication skills, especially in highpressure incident and vendorcoordination scenarios.
- Experience working across organizational and vendor boundaries to resolve complex operational issues.
- Sound engineering judgment when assessing risk, prioritizing work, and making reliability tradeoffs in production environments.
- Experience supporting production systems with oncall responsibilities and incident response expectations.
- Strong experience working with observability data (metrics, logs, alerts) to diagnose issues and drive improvements.
- Comfort using automation and AIassisted tools as part of everyday operational workflows.
- Experience supporting enterprisescale SaaS platforms or shared services.
- Prior experience working directly with vendors to resolve reliability or operational issues.
- Familiarity with cloudbased and distributed system architectures.