What are the responsibilities and job description for the Senior Site Reliability Engineer position at Jobs via Dice?
Responsibilities:
8 or more years of experience, relies on experience and judgment to plan and accomplish goals, independently performs a variety of complicated tasks, a wide degree of creativity and latitude is expected.
Understands business objectives and problems, identifies alternative solutions, performs studies and cost/benefit analysis of alternatives. Analyzes user requirements, procedures, and problems to automate processing or to improve existing computer system: Confers with personnel of organizational units involved to analyze current operational procedures, identify problems, and learn specific input and output requirements, such as forms of data input, how data is to be; summarized, and formats for reports. Writes detailed description of user needs, program functions, and steps required to develop or modify computer program. Reviews computer system capabilities, specifications, and scheduling limitations to determine if requested program or program change is possible within existing system.
Site Reliability Engineer will be responsible for ensuring the reliability, availability, performance, and scalability of production systems by applying software engineering practices to infrastructure and operations. Partners with development teams to build resilient, observable, and automated platforms that meet defined service level objectives (SLOs).
Qualifications:
Minimum Requirements:
Candidates that do not meet or exceed the minimum stated requirements (skills/experience) will be displayed to customers but may not be chosen for this opportunity.
Years
Required/Preferred
Experience
8
Required
experience in systems engineering, DevOps, or site reliability engineering roles
8
Required
Strong experience with Linux/Unix systems and system internals
8
Required
Proficiency in one or more programming/scripting languages (Python, Go, Java, Bash)
8
Required
Experience designing and operating highly available, distributed systems
8
Required
Strong knowledge of cloud platforms (AWS, or Google Cloud Platform) and cloud-native services
8
Required
Experience with containerization and orchestration (Docker, Kubernetes)
8
Required
Strong understanding of monitoring, alerting, and logging concepts
8
Required
Experience defining and managing SLIs, SLOs, and error budgets
8
Required
Familiarity with incident management, root cause analysis (RCA), and postmortems
8
Required
Experience integrating security and compliance into operational workflows
4
Preferred
Familiarity with observability tools (Prometheus, Grafana, Application Insights, Datadog, Splunk)
4
Preferred
Experience operating 24x7 production environments with on-call rotations
4
Preferred
Experience with chaos engineering and resiliency testing
4
Preferred
Experience with feature flags, canary deployments, and progressive delivery
4
Preferred
Strong documentation skills for runbooks, dashboards, and operational standards
8 or more years of experience, relies on experience and judgment to plan and accomplish goals, independently performs a variety of complicated tasks, a wide degree of creativity and latitude is expected.
Understands business objectives and problems, identifies alternative solutions, performs studies and cost/benefit analysis of alternatives. Analyzes user requirements, procedures, and problems to automate processing or to improve existing computer system: Confers with personnel of organizational units involved to analyze current operational procedures, identify problems, and learn specific input and output requirements, such as forms of data input, how data is to be; summarized, and formats for reports. Writes detailed description of user needs, program functions, and steps required to develop or modify computer program. Reviews computer system capabilities, specifications, and scheduling limitations to determine if requested program or program change is possible within existing system.
Site Reliability Engineer will be responsible for ensuring the reliability, availability, performance, and scalability of production systems by applying software engineering practices to infrastructure and operations. Partners with development teams to build resilient, observable, and automated platforms that meet defined service level objectives (SLOs).
Qualifications:
Minimum Requirements:
Candidates that do not meet or exceed the minimum stated requirements (skills/experience) will be displayed to customers but may not be chosen for this opportunity.
Years
Required/Preferred
Experience
8
Required
experience in systems engineering, DevOps, or site reliability engineering roles
8
Required
Strong experience with Linux/Unix systems and system internals
8
Required
Proficiency in one or more programming/scripting languages (Python, Go, Java, Bash)
8
Required
Experience designing and operating highly available, distributed systems
8
Required
Strong knowledge of cloud platforms (AWS, or Google Cloud Platform) and cloud-native services
8
Required
Experience with containerization and orchestration (Docker, Kubernetes)
8
Required
Strong understanding of monitoring, alerting, and logging concepts
8
Required
Experience defining and managing SLIs, SLOs, and error budgets
8
Required
Familiarity with incident management, root cause analysis (RCA), and postmortems
8
Required
Experience integrating security and compliance into operational workflows
4
Preferred
Familiarity with observability tools (Prometheus, Grafana, Application Insights, Datadog, Splunk)
4
Preferred
Experience operating 24x7 production environments with on-call rotations
4
Preferred
Experience with chaos engineering and resiliency testing
4
Preferred
Experience with feature flags, canary deployments, and progressive delivery
4
Preferred
Strong documentation skills for runbooks, dashboards, and operational standards