What are the responsibilities and job description for the Site Reliability Engineer position at Optomi?

Site Reliability Engineer (Hybrid – Orlando, FL | Burbank, CA | Seattle, WA)

Optomi, in partnership with a leading entertainment organization, is seeking a Site Reliability Engineer to join their Generative AI Engineering team. This engineer will play a critical role in shaping the reliability, scalability, and operational excellence of cloud infrastructure supporting enterprise AI and conversational experience platforms. The ideal candidate will bring deep expertise in cloud infrastructure, Kubernetes, Terraform, and DevOps practices while serving as a technical leader across complex multi-cloud environments. This is a highly visible role responsible for ensuring platform stability, driving automation initiatives, and supporting mission-critical AI applications across GCP, AWS, and Azure. This position is open to candidates located in Orlando, FL, Burbank, CA, or Seattle, WA, and follows a hybrid schedule requiring onsite presence four days per week with remote work on Fridays.

Location: Orlando, FL | Burbank, CA | Seattle, WA (Hybrid – Onsite Monday through Thursday, Remote Friday)

What the right candidate will enjoy:

Leading and mentoring a team of Site Reliability Engineers and DevOps specialists
Working with cutting-edge technologies in multi-cloud environments
Playing a pivotal role in the reliability strategy of generative AI platforms

What type of experience does the right candidate have:

7 years of Site Reliability Engineering, DevOps, Platform Engineering, or Infrastructure Engineering experience
Expert-level experience with Kubernetes administration, operations, cluster scaling, and Helm-based configuration management
Advanced experience building and managing Infrastructure-as-Code using Terraform
Strong experience implementing and supporting CI/CD pipelines using Harness or similar deployment orchestration platforms
Hands-on experience managing cloud infrastructure across GCP, AWS, and Azure, with strong expertise in GCP environments
Strong scripting and automation experience using Python, Bash, and YAML
Experience supporting backend infrastructure technologies including PostgreSQL, Redis, Kafka, MongoDB, and HashiCorp Vault
Deep understanding of observability, monitoring, alerting, logging, and system reliability best practices
Strong troubleshooting, root cause analysis, and incident response experience in complex production environments
Experience implementing security, identity management, compliance controls, and cloud governance standards
Demonstrated leadership experience within Agile/Scrum environments
Excellent written communication and cross-functional collaboration skills

What the responsibilities are of the right candidate:

Architect, design, and maintain highly available cloud infrastructure supporting Generative AI and conversational experience platforms
Develop and maintain Kubernetes environments, Helm charts, and Terraform modules to support scalable platform operations
Design and implement automated deployment processes and CI/CD pipelines utilizing Harness and Infrastructure-as-Code principles
Manage and support multi-cloud infrastructure environments across GCP, AWS, and Azure
Ensure platform reliability and availability while maintaining aggressive uptime and service-level objectives
Implement observability, monitoring, alerting, logging, and tracing solutions across infrastructure and application environments
Support and maintain critical backend services including Kafka, PostgreSQL, Redis, MongoDB, Vault, and other platform dependencies
Lead incident response efforts, root cause investigations, and post-incident remediation activities
Establish operational processes including capacity planning, patch management, backups, disaster recovery, and infrastructure lifecycle management
Collaborate with architects, engineering leadership, and development teams to drive platform improvements and reliability initiatives
Evaluate and implement automation opportunities, including AI-driven operational tooling and proactive monitoring solutions
Mentor engineers and serve as a technical leader for infrastructure, reliability, and operational best practices

Preferred Qualifications:

Previous experience leading large-scale cloud infrastructure or platform engineering initiatives
Experience supporting AI, machine learning, or Generative AI platforms
Experience with open-source infrastructure and orchestration technologies such as Apache Airflow
Exposure to AI-assisted operations, predictive monitoring, or AIOps tooling
Experience operating within hybrid cloud environments and enterprise-scale deployment ecosystems

Apply for this job

Receive alerts for other Site Reliability Engineer job openings

Site Reliability Engineer

What are the responsibilities and job description for the Site Reliability Engineer position at Optomi?

What is the career path for a Site Reliability Engineer?

Job openings at Optomi

Not the job you're looking for? Here are some other Site Reliability Engineer jobs in the Orlando, FL area that may be a better fit.

We don't have any other Site Reliability Engineer jobs in the Orlando, FL area right now.

AI Assistant is available now!