What are the responsibilities and job description for the Site Reliability Engineer position at Matlen Silver?

The Site Reliability Engineer (SRE) Lead is responsible for ensuring the reliability, scalability, performance, and security of enterprise Linux-based systems. This role combines deep technical expertise in Linux administration with leadership in automation, observability, incident management, and infrastructure engineering. The SRE Lead drives operational excellence by implementing best practices, improving system resilience, and mentoring engineering teams.

Platform Engineering Operations

Lead the administration, monitoring, and performance tuning of Oracle Enterprise Linux (OEL) environments in a large-scale enterprise ecosystem.

Oversee the design, build, and lifecycle management of Linux servers, including storage, virtualization, and associated infrastructure.

Manage high availability (HA) configurations, clustering, and load-balanced environments to ensure minimal downtime.

Drive capacity planning, performance optimization, and system scalability initiatives.

Reliability Automation (SRE Practices)

Define and implement SRE principles, including SLIs, SLOs, and error budgets.

Lead initiatives for infrastructure automation (provisioning, configuration, patching) using tools such as Ansible.

Build and maintain self-healing systems, reducing manual intervention and improving system resilience.

Develop automation for system installation, configuration, and deployment pipelines.

System Administration Infrastructure Management

Install, configure, and maintain Oracle Enterprise Linux (OEL) operating systems and related software stacks.

Manage Logical Volume Manager (LVM) configurations, including volume groups and filesystem expansion.

Administer distributed file systems, NFS servers/clients, and automount configurations.

Maintain network services such as DNS, NTP, LDAP/Kerberos, SMTP (sendmail/postfix), and OpenSSH.

Troubleshoot and support network protocols (TCP/IP, HTTP, HTTPS, RPC).

Monitoring, Incident Management Support

Implement and enhance monitoring, alerting, and observability frameworks for proactive issue detection.

Lead incident response, root cause analysis (RCA), and postmortem reviews.

Drive continuous improvement by identifying systemic issues and implementing preventive solutions.

Oversee break/fix operations, ensuring timely resolution and minimal business impact.

Security Compliance

Ensure systems are secure, hardened, and compliant with enterprise security standards.

Manage patching, vulnerability remediation, and OS upgrades.

Partner with security teams to implement best practices for access control, auditing, and encryption.

Leadership Collaboration

Provide technical leadership and mentorship to SRE and infrastructure teams.

Collaborate with application, DevOps, and platform teams to improve system reliability and deployment processes.

Define and enforce operational standards, runbooks, and best practices.

Drive cross-functional initiatives to enhance platform stability and efficiency.

Documentation Governance

Maintain comprehensive documentation for architecture, processes, and operational procedures.

Ensure adherence to change management and incident governance frameworks.

Standardize operational workflows across environments.

Required Qualifications

5 years of experience in Linux system administration in enterprise environments.

Strong expertise in Oracle Enterprise Linux (OEL) systems.

Proven experience in high availability systems, virtualization, and storage management.

Hands-on experience with automation and configuration management tools (Ansible preferred).

Proficiency in at least one scripting/programming language (Bash, Python preferred).

Strong experience with infrastructure troubleshooting, performance tuning, and incident management.

Solid understanding of enterprise infrastructure (compute, storage, network).

Excellent analytical, problem-solving, and organizational skills.

Strong communication and collaboration skills in a global team environment.

Apply for this job

Receive alerts for other Site Reliability Engineer job openings

Site Reliability Engineer

What are the responsibilities and job description for the Site Reliability Engineer position at Matlen Silver?

What is the career path for a Site Reliability Engineer?

Job openings at Matlen Silver

Not the job you're looking for? Here are some other Site Reliability Engineer jobs in the Chandler, AZ area that may be a better fit.

We don't have any other Site Reliability Engineer jobs in the Chandler, AZ area right now.

AI Assistant is available now!