What are the responsibilities and job description for the Platform Site Reliability Engineer position at MW Partners LLC?
Responsibilities and duties:
- Ensure the reliability, availability, and operational health of a portfolio of SaaSbased solutions, including vendormanaged services and inhouse (customerzero) platforms.
- Participate in oncall rotations and incident response, leading investigation, mitigation, coordination, and postincident followup.
- Establish and maintain effective observability for systems that are not fully owned, identifying practical ways to obtain actionable metrics, logs, and signals from vendor and partner solutions.
- Use operational data and incident learnings to identify reliability risks and drive targeted improvements that reduce customer impact.
- Apply appropriate change controls at owned or influenced layers of the stack, balancing reliability, velocity, and business needs.
- Partner with internal teams and external vendors to communicate expectations, coordinate response and remediation, and influence reliability outcomes.
- Produce clear incident communications and postincident analyses that inform stakeholders and drive lasting improvements.
- Leverage automation and AIassisted tooling to improve detection, triage, and operational efficiency.
Requirements
- Bachelor s degree in Computer Science, Information Technology, Engineering, or a related field, or equivalent practical experience.
- 3 5 years of experience in Site Reliability Engineering, production operations, or a closely related role.
- Strong foundation in Site Reliability Engineering practices, including observability, incident response, and reliability measurement.
- Handson experience operating SaaS or thirdparty systems where fullstack ownership is limited.
- Deep understanding of monitoring, logging, and alerting, with the ability to design signals that are actionable rather than noisy.
- Proven incident response experience, including oncall participation and crossteam coordination during highimpact events.
- Ability to think creatively and pragmatically when instrumenting and improving systems with constrained control.
- Excellent written and verbal communication skills, especially in highpressure incident and vendorcoordination scenarios.
- Experience working across organizational and vendor boundaries to resolve complex operational issues.
- Sound engineering judgment when assessing risk, prioritizing work, and making reliability tradeoffs in production environments.
- Experience supporting production systems with oncall responsibilities and incident response expectations.
- Strong experience working with observability data (metrics, logs, alerts) to diagnose issues and drive improvements.
- Comfort using automation and AIassisted tools as part of everyday operational workflows.
Preferred Skills:
- Experience supporting enterprisescale SaaS platforms or shared services.
- Prior experience working directly with vendors to resolve reliability or operational issues.
- Familiarity with cloudbased and distributed system architectures.
Salary : $45 - $56