What are the responsibilities and job description for the Senior Site Reliability Engineer position at Duetto?
- About the Company
Duetto is building the future of hotel revenue strategy. We're not just another SaaS company — we're redefining what's possible for hotels through our category-creating platform, the Revenue & Profit Operating System.
- Role Summary / Purpose
Our technology stack is built on AWS and primarily consists of:
- Java
- Python
- NoSql
- Single-page JavaScript web techniques (jQuery, Backbone, React, and RequireJS)
- Patent-pending analytical methods on top of MongoDB
- Postgres
- Terraform/Terragrunt and Chef for IaC
- DataDog and Prometheus
- GitHub for source control
- GitHub Actions and Jenkins for CI/CD
- Key Responsibilities
- Architect and implement infrastructure solutions to facilitate seamless migration of critical systems while ensuring uptime, reliability, and a high-quality experience for end users.
- Design, develop, test, and maintain tools and processes to efficiently manage and operate SaaS products hosted on AWS, with a focus on scalability and automation.
- Partner with developers to enhance the reliability, performance, scalability, and security of server and application architectures.
- Build and maintain critical components of our infrastructure, emphasizing robustness, security, and high availability to meet demanding service-level expectations.
- Foster strong cross-team collaboration by driving engagement, promoting shared goals, and ensuring alignment across technical and non-technical teams.
- Lead efforts to ensure systems are secure by default, addressing vulnerabilities proactively and implementing best practices for cybersecurity preparedness.
- Be willing to learn and adopt AI in DevOps/SRE workflows.
- Be the last line of support for services that thousands of customers (hotels, resorts, casinos, etc.) around the world depend on 24/7.
- Troubleshoot on-call incidents to ensure rapid resolution and minimal service disruption. Participate in detailed Root Cause Analysis (RCA) to identify underlying issues and work cross-functionally to implement preventative measures and long-term solutions, ensuring similar problems are avoided in the future.
- Qualifications
- 5 years of experience in an Ops, DevOps or SRE role.
- Experience in System Design and Architecture.
- Engineer-level experience with networking and security concepts.
- Understanding of fundamentals behind load balancing technologies. Experience configuring Layer 7 load-balancing is a plus.
- Experience collaborating with engineers on architecture decisions.
- Experience administering Cloud Computing Services such as AWS (preferred), Azure, or GCP, including working knowledge of permissions structures, multi-account management structures, and single sign-on(SSO).
- Experience with AWS ecosystem tools such as AWS IAM, VPC, EC2, ELB, RDS, S3, Lambda, API Gateway, Secrets Manager, KMS, CloudWatch, CloudTrail.
- Experience with security compliance certifications such as SOC2.
- Experience working in an environment with a heavy emphasis on DevOps and Service Reliability mindset.
- Experience provisioning, configuring, administering, and using enterprise monitoring ecosystems like Prometheus, Grafana, DataDog or similar.
- Experience with CI/CD Tools such as GitHub, GitHub Actions, JFrog Artifactory, Jenkins, and GitOps methodologies.
- Experience using and writing infrastructure-as-code using Terraform.
- Experience with configuration-management toolsets such as Chef or Puppet.
- Experience with containers and container orchestration tools such as ECS/EKS (a plus).
- Experience managing infrastructure and contributing as part of a multi-user infrastructure team, using Terraform and associated toolsets. Relevant SOC2 experience is also a plus.
- Fluency in reading Java, Ruby, Bash/Zsh, HCL, Python and Javascript.
- Strong experience in troubleshooting and resolving complex on-call incidents with a focus on minimizing service disruption and downtime.
- Proven ability to lead and participate in detailed Root Cause Analysis (RCA) processes to identify and address underlying issues effectively.
- Demonstrated expertise in implementing preventative measures and long-term solutions based on RCA findings to ensure recurring issues are mitigated.
- Experience constructing and maintaining build/deploy automation tooling.
- Participate in weekly on-call rotation.
- Ability to work both independently and within a team environment.
- A passion for technology with a drive to stay up to date with technology and best practices.
- Team Player - Works well with others, highly collaborative and acts as a strong partner to other team members and functions.
- Execution - Desire to work on a fast paced team and help set direction and architecture.
- Creativity - Thrives in an environment without a set playbook.
- Quality - Takes pride in delivering robust and high quality implementations.
- Ownership - Enjoys owning and driving projects.