What are the responsibilities and job description for the Sr. Site Reliability Engineer position at Masterapp Labs?
Job Title: Sr. Site Reliability Engineer
Location: Austin, TX, (Hybrid)- 3 days remote with 2 days (Mondays and Thursdays) required to be onsite
Position Type: Contract
Interview Mode: Both MS Teams & In-person
Note: Program will only accept LOCAL ONLY candidates for this position.
Job Description:
We are seeking an experienced Site Reliability Engineer to ensure the reliability, availability, performance, and scalability of production systems. The ideal candidate will apply software engineering principles to infrastructure and operations while working closely with development teams to build resilient, observable, and automated platforms aligned with defined service level objectives (SLOs).
Key Responsibilities:
- Understand business objectives and identify problems, proposing effective solutions
- Perform studies and cost-benefit analysis of alternative approaches
- Analyze user requirements, procedures, and existing systems to improve or automate processes
- Collaborate with cross-functional teams to understand operational workflows and system needs
- Document detailed requirements, system functions, and development steps
- Evaluate system capabilities, limitations, and feasibility of enhancements
- Independently handle complex tasks with a high degree of creativity and judgment
- Ensure system reliability, scalability, and performance through best engineering practices
Requirements:
- Strong experience in handling complex technical tasks independently
- Proven ability to analyze and improve system processes
- Excellent problem-solving and analytical skills
- Ability to work with stakeholders to gather and define requirements
- Experience in building and maintaining reliable and scalable systems
II. CANDIDATE SKILLS AND QUALIFICATIONS
|
Minimum Requirements:
Candidates that do not meet or exceed the minimum stated requirements (skills/experience) will be displayed to customers but may not be chosen for this opportunity.
|
||
|
Years
|
Required/Preferred
|
Experience
|
|
8
|
Required
|
experience in systems engineering, DevOps, or site reliability engineering roles
|
|
8
|
Required
|
Strong experience with Linux/Unix systems and system internals
|
|
8
|
Required
|
Proficiency in one or more programming/scripting languages (Python, Go, Java, Bash)
|
|
8
|
Required
|
Experience designing and operating highly available, distributed systems
|
|
8
|
Required
|
Strong knowledge of cloud platforms (AWS, or Google Cloud Platform) and cloud-native services
|
|
8
|
Required
|
Experience with containerization and orchestration (Docker, Kubernetes)
|
|
8
|
Required
|
Strong understanding of monitoring, alerting, and logging concepts
|
|
8
|
Required
|
Experience defining and managing SLIs, SLOs, and error budgets
|
|
8
|
Required
|
Familiarity with incident management, root cause analysis (RCA), and postmortems
|
|
8
|
Required
|
Experience integrating security and compliance into operational workflows
|
|
4
|
Preferred
|
Familiarity with observability tools (Prometheus, Grafana, Application Insights, Datadog, Splunk)
|
|
4
|
Preferred
|
Experience operating 24x7 production environments with on-call rotations
|
|
4
|
Preferred
|
Experience with chaos engineering and resiliency testing
|
|
4
|
Preferred
|
Experience with feature flags, canary deployments, and progressive delivery
|
|
4
|
Preferred
|
Strong documentation skills for runbooks, dashboards, and operational standards
|