What are the responsibilities and job description for the Site Reliability Engineer, Consultant position at Blue Shield of CA?
Your Role
We are seeking an Experienced Site Reliability Engineer (SRE) to lead reliability, scalability, and performance initiatives across our production systems. In this role, you will blend software engineering, automation, and systems operations to ensure that our platforms are resilient, efficient, and continuously improving.
You will be part of a cross-functional team responsible for designing, implementing, and maintaining reliable systems that support millions of requests daily. This position requires a deep understanding of distributed systems, cloud infrastructure, automation, and incident response.
Your Knowledge and Experience
Education & Experience
- Requires a Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience); Master's degree a plus.
- 7 years of experience in building, supporting, and improving production systems and infrastructure.
Cloud Platforms
- Minimum 5 years of hands-on experience with Azure, AWS, or GCP.
- Demonstrated expertise in virtual machines (VMs), containers, cloud networking, identity and access management (IAM), monitoring, storage, and serverless functions.
- Comfortable deploying and managing cloud-native services and infrastructure.
Programming & Scripting
- Proficiency in one or more languages such as Python, Go, Java, Bash, PowerShell, or similar.
- Ability to write clean, maintainable code for automation and tooling.
Containerization & Orchestration
- Experience working with Kubernetes, Docker, and tools like Helm or Red Hat OpenShift.
- Familiarity with managing containerized applications in production environments.
Monitoring & Observability
- Working knowledge of tools such as Prometheus, Grafana, Datadog, New Relic, ELK Stack, Dynatrace, Splunk, Big Panda, SolarWinds.
- Ability to set up dashboards, alerts, and metrics to ensure system health and performance.
CI/CD & Configuration Management
- Experience with CI/CD pipelines using tools like Jenkins, GitHub Actions, GitLab CI, Argo CD, Spinnaker.
- Familiarity with configuration management tools such as Ansible, Chef, Puppet.
Automation & Emerging Technologies
- Understanding of Agentic AI systems and automation frameworks for incident response and infrastructure optimization is a plus.
- Interest in exploring intelligent automation to improve reliability and reduce manual toil.
Testing & Deployment Expertise
- Experience with chaos engineering tools (e.g., Gremlin, Chaos Monkey) and methodologies.
- Hands-on knowledge of Blue/Green and Canary deployment strategies in cloud-native environments.
#LI-EB1
External hires must pass a background check/drug screen. Qualified applicants with arrest records and/or conviction records will be considered for employment in a manner consistent with Federal, State and local laws, including but not limited to the San Francisco Fair Chance Ordinance. All qualified applicants will receive consideration for employment without regards to race, color, religion, sex, national origin, sexual orientation, gender identity, protected veteran status or disability status and any other classification protected by Federal, State and local laws.