What are the responsibilities and job description for the Site Reliability Engineer (89322-1) position at Jobs via Dice?

Dice is the leading career destination for tech experts at every stage of their careers. Our client, Key Business Solutions, Inc., is seeking the following. Apply via Dice today!

Site Reliability Engineer (89322-1)

Alpharetta, GA

12 Months

ROLE_DESCRIPTION -

Skill Set - Expertise in UNIX LINUX Administration AWS/ AZURE Cloud monitoring Terraform/ Ansible Promethe Grafana observability experience).

Work Location - Alpharetta

Experience required for role - 6 years

Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
Solid experience in capacity planning, performance tuning, scaling, and incident response
Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
Excellent communication, documentation, and cross-team collaboration skills
Proven track record of reducing operational toil via automation

Experience: 6 years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineering knowledge.

Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
Design and build automation for core platform capabilities, reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
Optimize cost vs. performance tradeoffs in large-scale compute environments
Harden systems for security, compliance, auditability, and data governance
Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms
Maintain runbooks, operational playbooks, documentation, and training materials
Participate in on-call rotations and respond to production incidents 24/7 as needed
Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Skills: Digital : Python~Digital : Docker~Digital : Kubernetes~Digital : Site Reliability Engineering (SRE)

Experience Required: 6-8

Skills: Category Name Required Importance Experience

SkillCategoryTest1_MN Digital : Site Reliability Engineering (SRE) Yes 1 4-7 years

Apply for this job

Receive alerts for other Site Reliability Engineer (89322-1) job openings

Site Reliability Engineer (89322-1)

What are the responsibilities and job description for the Site Reliability Engineer (89322-1) position at Jobs via Dice?

What is the career path for a Site Reliability Engineer (89322-1)?

Job openings at Jobs via Dice

Not the job you're looking for? Here are some other Site Reliability Engineer (89322-1) jobs in the Alpharetta, GA area that may be a better fit.

We don't have any other Site Reliability Engineer (89322-1) jobs in the Alpharetta, GA area right now.

AI Assistant is available now!