What are the responsibilities and job description for the Site Reliability Engineer position at Optomi?
Site Reliability Engineer (Hybrid – Orlando, FL | Burbank, CA | Seattle, WA)
Optomi, in partnership with a leading entertainment organization, is seeking a Site Reliability Engineer to join their Generative AI Engineering team. This engineer will play a critical role in shaping the reliability, scalability, and operational excellence of cloud infrastructure supporting enterprise AI and conversational experience platforms. The ideal candidate will bring deep expertise in cloud infrastructure, Kubernetes, Terraform, and DevOps practices while serving as a technical leader across complex multi-cloud environments. This is a highly visible role responsible for ensuring platform stability, driving automation initiatives, and supporting mission-critical AI applications across GCP, AWS, and Azure. This position is open to candidates located in Orlando, FL, Burbank, CA, or Seattle, WA, and follows a hybrid schedule requiring onsite presence four days per week with remote work on Fridays.
Location: Orlando, FL | Burbank, CA | Seattle, WA (Hybrid – Onsite Monday through Thursday, Remote Friday)
What the right candidate will enjoy:
- Leading and mentoring a team of Site Reliability Engineers and DevOps specialists
- Working with cutting-edge technologies in multi-cloud environments
- Playing a pivotal role in the reliability strategy of generative AI platforms
What type of experience does the right candidate have:
- 7 years of Site Reliability Engineering, DevOps, Platform Engineering, or Infrastructure Engineering experience
- Expert-level experience with Kubernetes administration, operations, cluster scaling, and Helm-based configuration management
- Advanced experience building and managing Infrastructure-as-Code using Terraform
- Strong experience implementing and supporting CI/CD pipelines using Harness or similar deployment orchestration platforms
- Hands-on experience managing cloud infrastructure across GCP, AWS, and Azure, with strong expertise in GCP environments
- Strong scripting and automation experience using Python, Bash, and YAML
- Experience supporting backend infrastructure technologies including PostgreSQL, Redis, Kafka, MongoDB, and HashiCorp Vault
- Deep understanding of observability, monitoring, alerting, logging, and system reliability best practices
- Strong troubleshooting, root cause analysis, and incident response experience in complex production environments
- Experience implementing security, identity management, compliance controls, and cloud governance standards
- Demonstrated leadership experience within Agile/Scrum environments
- Excellent written communication and cross-functional collaboration skills
What the responsibilities are of the right candidate:
- Architect, design, and maintain highly available cloud infrastructure supporting Generative AI and conversational experience platforms
- Develop and maintain Kubernetes environments, Helm charts, and Terraform modules to support scalable platform operations
- Design and implement automated deployment processes and CI/CD pipelines utilizing Harness and Infrastructure-as-Code principles
- Manage and support multi-cloud infrastructure environments across GCP, AWS, and Azure
- Ensure platform reliability and availability while maintaining aggressive uptime and service-level objectives
- Implement observability, monitoring, alerting, logging, and tracing solutions across infrastructure and application environments
- Support and maintain critical backend services including Kafka, PostgreSQL, Redis, MongoDB, Vault, and other platform dependencies
- Lead incident response efforts, root cause investigations, and post-incident remediation activities
- Establish operational processes including capacity planning, patch management, backups, disaster recovery, and infrastructure lifecycle management
- Collaborate with architects, engineering leadership, and development teams to drive platform improvements and reliability initiatives
- Evaluate and implement automation opportunities, including AI-driven operational tooling and proactive monitoring solutions
- Mentor engineers and serve as a technical leader for infrastructure, reliability, and operational best practices
Preferred Qualifications:
- Previous experience leading large-scale cloud infrastructure or platform engineering initiatives
- Experience supporting AI, machine learning, or Generative AI platforms
- Experience with open-source infrastructure and orchestration technologies such as Apache Airflow
- Exposure to AI-assisted operations, predictive monitoring, or AIOps tooling
- Experience operating within hybrid cloud environments and enterprise-scale deployment ecosystems