Demo

Site Reliability Engineer

Alibaba Cloud
Bellevue, WA Full Time
POSTED ON 12/23/2025
AVAILABLE BEFORE 1/29/2026

We are the Apsara Lab at Alibaba Cloud Intelligence Group, committed to delivering a cutting-edge MaaS platform and toolkits for application development through technological innovation and engineering practices. Our team focuses on the fundamental R&D in model services, while also providing full-stack development that ranges from architecture design to model applications. Our goal is to build the industry's largest model service platform with excellent cost-efficiency, high performance, and enterprise-level reliability. By doing so, we aim to empower numerous enterprise clients to accelerate the development of model applications.


We are seeking a passionate and technically skilled Site Reliability Engineer (SRE) to join the our team. You will play a critical role in building and maintaining highly available, high-performance model service platform. Your responsibilities will include optimizing monitoring and alerting, incident response, troubleshooting customer issues, and developing automation systems to ensure the stability and reliability of Alibaba Cloud platfrom and system aplications.


Key Responsibilities

1. Oversee the deployment, operation, maintenance, and continuous improvement of the standalone website and platform, including its initial construction and subsequent operational changes.

2. System Reliability

* Oversee the monitoring and alerting of our platform's and system aplications, rapidly diagnosing and resolving network, service, and hardware-level failures to meet SLA targets.

* Design and optimize monitoring metrics, log collection, and alerting strategies to enhance system observability.

* Participate in the emergency response and handling of online incidents, conduct root cause analysis (RCA), and drive long-term solutions to prevent recurrence.

3. Customer Issue Resolution

* Investigate and resolve customer-reported issues related to QoS of API service(e.g., latency, performance, optimization), collaborating with development teams to identify flaws in application clusters, edge networks, or infrastructure.

4. Automation & Continuous Improvement

* Develop tools and scripts (Python/Go) to automate deployment, scaling, fault recovery, and other operational workflows.

* Build automated diagnostic toolchains to accelerate issue resolution and improve customer satisfaction.


Minimum qualification:

- 3 years of experience in SRE, DevOps, or backend development, with expertise in distributed system operations. Experience in cloud computing, AI infrastructure, Alibaba Cloud is a plus.

- Experience programming with at least one modern language such as Python, Golang, Java, C .

- Strong ability to work under pressure, manage critical incidents, and participate in an on-call rotation.

- Fluency in both Chinese and English for daily communication.


Preferred qualification:

- Familiarity with MaaS or related knowledge.

- Deep knowledge of Linux systems, network protocols (TCP/HTTP), and databases, have deep understanding of cloud-native architecture design.

- Experience with large-scale containers, kubernetes cluster operation and maintenance, have strong professional knowledge of Cloud Native related components (e.g., Prometheus, Istio, Calico, etc.).

- Extensive experience in building large-scale monitoring systems and utilizing them for in-depth analysis and operations.



The pay range for this position at commencement of employment is expected to be between $133,200/year and $219,600/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.


If hired, employee will be in an “at-will position” and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.


Alibaba U.S. based full time regular employees have access to medical, dental, and vision insurance, a 401(k) plan and basic life insurance, and wellbeing benefits like FSA, subject to the terms and conditions of the applicable plans then in effect. U.S. based employees are also eligible to receive up to 12 paid holidays, accrue up to 15 paid vacation days for this position, and receive up to 72 hours paid sick time (front-loaded) per calendar year.

Salary : $133,200 - $219,600

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Site Reliability Engineer?

Sign up to receive alerts about other jobs on the Site Reliability Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$92,369 - $122,605
Income Estimation: 
$117,024 - $149,811
Income Estimation: 
$117,024 - $149,811
Income Estimation: 
$137,568 - $176,908
Income Estimation: 
$158,960 - $205,707
Income Estimation: 
$154,509 - $200,187
Income Estimation: 
$71,493 - $96,419
Income Estimation: 
$92,369 - $122,605
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Alibaba Cloud

  • Alibaba Cloud Sunnyvale, CA
  • The Alibaba Cloud Network Team is at the core of the Alibaba Cloud Apsara Platform, offering a rich array of network resources and solutions within the ind... more
  • 12 Days Ago

  • Alibaba Cloud Sunnyvale, CA
  • Job Description: Customer Relationship Building and Business Opportunity Development •Proactively analyze key industries within the assigned country/market... more
  • 12 Days Ago

  • Alibaba Cloud Sunnyvale, CA
  • Job Description ● Build and own relationships with AI-native companies founders, CTOs, engineers, and product leaders across the U.S. ● Understand technica... more
  • 3 Days Ago

  • Alibaba Cloud Sunnyvale, CA
  • Job Description: 1. Strategic Customer Growth & Relationship Leadership ● Own a portfolio of mid-to-large Media & Entertainment enterprises—from initial en... more
  • 3 Days Ago


Not the job you're looking for? Here are some other Site Reliability Engineer jobs in the Bellevue, WA area that may be a better fit.

  • cognitiv Bellevue, WA
  • Are you ready to revolutionize the advertising industry? At Cognitiv, we are not just another AdTech company—we are industry trailblazers redefining media ... more
  • 1 Month Ago

  • Qumulo Seattle, WA
  • About The Company Qumulo is the unstructured data platform to store and manage exabyte-scale data anywhere – at the edge, in the core data center and in th... more
  • 14 Days Ago

AI Assistant is available now!

Feel free to start your new journey!