What are the responsibilities and job description for the Cloud Operations/Platform Manager position at ALTEN?
Job Description - Manager, Cloud/Platform Operations
Location: Atlanta / Roswell, GA (Onsite with Offshore Team Management)
Role Summary
Rheem is seeking a Manager, Cloud Operations to lead, transform, and scale its digital operations landscape across CloudOps, SRE, NOC, Observability, AIOps, and MLOps. This individual will serve as the single point of accountability for operational stability and innovation, managing offshore teams while working closely with Rheem's U.S. digital leadership.
This role is not a steady-state manager position. The successful candidate will:
Key Responsibilities
Operations Strategy & Governance
Must-Have
Location: Atlanta / Roswell, GA (Onsite with Offshore Team Management)
Role Summary
Rheem is seeking a Manager, Cloud Operations to lead, transform, and scale its digital operations landscape across CloudOps, SRE, NOC, Observability, AIOps, and MLOps. This individual will serve as the single point of accountability for operational stability and innovation, managing offshore teams while working closely with Rheem's U.S. digital leadership.
This role is not a steady-state manager position. The successful candidate will:
- Identify operational gaps.
- Suggest and implement best practices and tools.
- Introduce automation and innovation strategies.
- Guide daily deliverables for offshore teams.
- Demonstrate tangible business impact each quarter (improved uptime, reduced MTTR, cost savings, predictive alerting, etc.).
Key Responsibilities
Operations Strategy & Governance
- Define the vision, strategy, and roadmap for CloudOps, Reliability, and Operational Excellence.
- Establish KPIs and OKRs aligned with Rheem's business goals (availability, MTTR, cloud cost per device, customer churn reduction).
- Deliver quarterly impact reports to business leadership showcasing operational improvements and ROI.
- Own multi-region cloud operations across AWS and Azure platforms.
- Drive cost transparency and optimization via FinOps practices and dashboards.
- Build capacity and resiliency models for predictable operations.
- Conduct resiliency drills and game days to ensure high availability and compliance.
- Establish SLIs, SLOs, and error budgets to measure reliability.
- Build incident management playbooks and drive blameless postmortems.
- Proactively improve reliability through automation, self-healing, and continuous testing.
- Transform NOC from alert-driven to predictive, AIOps-enabled operations.
- Consolidate monitoring tools and reduce alert fatigue with intelligent correlation.
- Ensure 24x7 support coverage through offshore team alignment and escalation management.
- Build a unified observability stack (logs, metrics, traces, RUM) leveraging OpenTelemetry.
- Enable business-oriented dashboards (device uptime, customer adoption, churn trends).
- Ensure end-to-end visibility from connected devices → cloud microservices → customer-facing apps.
- Deploy AIOps solutions for anomaly detection, predictive alerts, event correlation, and automated remediation.
- Operationalize ML models: rollout, monitoring, drift detection, rollback strategies.
- Showcase measurable value, e.g., warranty claim reduction, improved customer experience metrics.
- Audit current toolchain and processes; identify redundancies, gaps, and opportunities for automation.
- Align with DevOps/SecOps to streamline release-to-operations handshakes.
- Drive Infrastructure-as-Code for operations (Terraform, Ansible, GitOps).
- Manage and mentor a distributed team (offshore onsite), setting clear goals and accountability.
- Define roles, responsibilities, and shift structures for 24x7 global coverage.
- Build a culture of continuous improvement and operational excellence.
- Ensure Rheem operations align with compliance standards (SOC2, ISO, HIPAA where applicable).
- Own business continuity planning and disaster recovery testing.
- Proactively identify operational risks and mitigate before they impact business.
- Act as the voice of operations at business leadership tables.
- Translate technical improvements into business outcomes (lower churn, improved uptime, faster installs, fewer complaints).
- Champion a quarterly innovation agenda to showcase improvements in uptime, cost, and reliability.
Must-Have
- Experience & Leadership
- 10 years of experience in Cloud Operations, Site Reliability Engineering, or Digital Operations.
- Proven track record of owning operational outcomes (uptime, MTTR, cost optimization, observability).
- Experience managing offshore/global delivery teams with 24x7 coverage.
- Strong leadership presence — able to act as a change agent, operate autonomously, and deliver measurable outcomes without day-to-day direction.
- Cloud & Technical Expertise
- Hands-on experience with AWS and/or Azure (multi-account, multi-region operations).
- Solid expertise with observability & monitoring tools (Datadog, Dynatrace, Splunk, Grafana, Prometheus, ELK/EFK).
- Familiarity with Infrastructure-as-Code (Terraform, Ansible, GitOps).
- Strong understanding of SRE principles (SLIs, SLOs, error budgets, incident management frameworks).
- Process & Governance
- Demonstrated ability to design and implement operations frameworks (Ops playbooks, NOC modernization, incident command systems).
- Knowledge of FinOps practices (cloud cost visibility, optimization, showback/chargeback).
- Experience ensuring compliance with SOC2, ISO, HIPAA or equivalent standards.
- Soft Skills
- Excellent stakeholder communication skills — ability to link operational KPIs with business outcomes.
- Strong team leadership and mentoring skills, especially across distributed teams.
- Exposure to AIOps platforms (Moogsoft, BigPanda, OpsRamp, ServiceNow AI modules).
- Experience with MLOps tooling (MLflow, Kubeflow, SageMaker, Azure ML) for model deployment and monitoring.
- Prior background in platform operations at a product/SaaS company (vs pure IT Ops).
- Experience leading automation-first initiatives (predictive alerts, self-healing infra, auto-remediation pipelines).
- Hands-on experience with CI/CD → Ops handshakes and change-impact assessments.
- Cloud certifications:
- AWS Certified Solutions Architect / DevOps Engineer
- Microsoft Certified: Azure Administrator / Solutions Architect
- FinOps Certified Practitioner