Demo

Cloud Infrastructure – Site Reliability Engineer (SRE)

Alibaba Cloud
Sunnyvale, CA Full Time
POSTED ON 10/31/2025
AVAILABLE BEFORE 11/29/2025

Alibaba Cloud Native Message Middleware Team is responsible for message products, including RocketMQ and other messaging products. We are committed to creating a more stable, user-friendly, streaming, and large-scale messaging platform for the future.


Cloud Product Operations & Reliability


Oversee stability maintenance, performance tuning, and high-availability architecture design for cloud middleware, including messaging middleware (Kafka/RocketMQ).


Manage the containerized middleware lifecycle on Kubernetes clusters: implement deployments, auto-scaling, version upgrades, and resource optimization in K8s environments.


Incident Response & Root Cause Analysis


Lead the troubleshooting of middleware-related incidents (e.g., message backlog, service registration failures) through log analysis, distributed tracing, and monitoring systems.


Develop diagnostic tools using Java/Go to resolve production issues, performance bottlenecks, and compatibility challenges.


Automation & Operational Excellence


Build Python/Go/Shell automation tools to standardize middleware deployment, monitoring, and disaster recovery workflows.


Implement chaos engineering experiments, capacity planning strategies, and failover mechanisms to enhance system resilience.


Strong scripting skills in Shell/Python and experience with Infrastructure as Code (IaC) tools (Terraform preferred).


Minimum qualification:

Experience: Over 2 years of experience in distributed systems reliability engineering, familiar with high-availability architecture design, and proficient in at least one of Python, Go, or Java.

Messaging: Cluster management, message reliability assurance, and performance optimization for Kafka/RocketMQ.

Hands-on experience deploying middleware on Kubernetes (Helm/Operator preferred).

Automation: Ability to convert operations experience into automated solutions and familiarity with various message middleware, e.g., Kafka and RocketMQ.


Preferred Qualification:

SRE Practices: Familiar with core SRE practices (incident review, error budgeting, chaos engineering) and experienced in building automated risk control systems.




The pay range for this position at commencement of employment is expected to be between $104,400 and $171,000/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.


If hired, employee will be in an “at-will position” and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.

Salary : $104,400 - $171,000

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Cloud Infrastructure – Site Reliability Engineer (SRE)?

Sign up to receive alerts about other jobs on the Cloud Infrastructure – Site Reliability Engineer (SRE) career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$120,933 - $155,034
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$151,875 - $212,356
Income Estimation: 
$169,957 - $202,398
Income Estimation: 
$76,670 - $90,826
Income Estimation: 
$91,609 - $118,978
Income Estimation: 
$92,877 - $110,401
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Alibaba Cloud

Alibaba Cloud
Hired Organization Address Sunnyvale, CA Full Time
We're seeking a skilled RDMA Ops Engineer to optimize and maintain high-performance networking infrastructure for our co...
Alibaba Cloud
Hired Organization Address Bellevue, WA Full Time
Alibaba Cloud Computing Platform Alibaba Cloud Computing Platform includes a proprietary big data platform ODPS (MaxComp...
Alibaba Cloud
Hired Organization Address Bellevue, WA Full Time
Alibaba Cloud Computing Platform Alibaba Cloud Computing Platform includes a proprietary big data platform ODPS (MaxComp...
Alibaba Cloud
Hired Organization Address Sunnyvale, CA Full Time
1. Lead the design and evolution of intelligence coding product architecture: build a high-availability and high-scalabi...

Not the job you're looking for? Here are some other Cloud Infrastructure – Site Reliability Engineer (SRE) jobs in the Sunnyvale, CA area that may be a better fit.

Cloud Platform Site Reliability Engineer

Alibaba Cloud, Sunnyvale, CA

Site Reliability Engineer

Alibaba Cloud, Sunnyvale, CA

AI Assistant is available now!

Feel free to start your new journey!