What are the responsibilities and job description for the Senior DevOps / Site Reliability Engineer position at CriticalRiver Inc.?
About the Role:
We’re looking for an experienced Senior DevOps / Site Reliability Engineer to design and build the cloud and reliability foundation for a new multi-tenant SaaS platform, while supporting our existing products. This is a foundational early hire with high impact—you’ll define AWS architecture, establish DevOps and SRE best practices, and ensure 99.9% uptime as we scale a multi-tenant platform. You’ll work closely with Platform, Backend, Frontend, and AI teams to enable fast, secure deployments and production-grade reliability.
What You’ll Do:
- Architect and manage AWS infrastructure (EKS, RDS, VPC, IAM, S3)
- Build and maintain Terraform-based Infrastructure as Code
- Own Kubernetes/EKS clusters, scaling, upgrades, and deployments
- Design and optimize CI/CD pipelines (GitHub Actions/Jenkins, GitOps)
- Implement monitoring, alerting, and observability (Datadog, CloudWatch)
- Lead incident response, on-call processes, and postmortems
- Define and track SLOs/SLIs and error budgets
- Implement security and compliance controls (SOC 2, IAM, encryption)
Required Qualifications:
- 7–10 years of DevOps / SRE experience in production environments
- Deep expertise in AWS and Kubernetes (EKS)
- Strong experience with Terraform or CloudFormation
- Proven ownership of CI/CD, monitoring, and incident management
- Experience supporting multi-tenant B2B SaaS platforms
- Strong scripting skills (Python or Bash)
- Security-first mindset with hands-on compliance exposure