What are the responsibilities and job description for the Site Reliability Engineer position at Rapsys Technologies?

Proven experience in high-availability, high-transaction environments (preferably payments or financial services).
Strong background in production resiliency and recovery (recovery execution, runbooks/playbooks, RCA mindset).
Incident pattern analysis MTTR baselines (P2 Major/Minor) and recurring failure taxonomy (by rail/service).
Senior-level observability expertise: dashboards, monitors, and alerts (Datadog preferred; similar tools considered).
Splunk, Datadog, SQLS, JQL Jira Query language, Gitlab,
Experience of CI / CD metrics and generating code quality, changes, testing automation executives reports from Gitlab Understand quality of stories, metrics, monitoring experiences - help get data to showcase deficiencies Senior CI/CD experience: pipeline design/operation, release safety patterns, and rollback readiness.
Experience using metrics and monitoring data to identify and communicate deficiencies.
Automation skills: Python and/or PowerShell (or equivalent) for building repeatable recovery workflows and operational tooling.
Kubernetes/container platform production troubleshooting (deployments, pods, config drift, safe restarts, and "why did this change break prod" investigations
Experience with identity/credentials/certificate & secret-rotation resilience (preventing outages during password rotations, certificate upgrades, and secret propagation; implementing guardrails and monitoring for these events).
Batch/scheduler/job-execution reliability (detecting/preventing silent job failures, validating multi-D scenarios, and building controls to ensure scheduled processing does not impact customers).
Distributed integration failure-handling (timeouts, retries, backpressure, idempotency, duplicate prevention, and reconciliation-especially across vendor/downstream dependencies).
Nice-to-have (differentiators)
Experience with SRE-style reliability practices (SLO/SLI thinking, error budgets, operational metrics).
Experience with failover / DC flip / active-active or active-passive recovery concepts and scenario-based runbooks.
Cloud Engineering (Azure, AWS) Devops tools expertise, (Jenkins, Terraform, Sonar Cube, Helm Charts) Network & traffic-management incident triage (load balancers/firewalls/VLAN changes, DC traffic flips, and rapid isolation of "app vs infra vs network" to stabilize service)

Salary : $50 - $55

Apply for this job

Receive alerts for other Site Reliability Engineer job openings

Site Reliability Engineer

What are the responsibilities and job description for the Site Reliability Engineer position at Rapsys Technologies?

What is the career path for a Site Reliability Engineer?

Job openings at Rapsys Technologies

Not the job you're looking for? Here are some other Site Reliability Engineer jobs in the Atlanta, GA area that may be a better fit.

We don't have any other Site Reliability Engineer jobs in the Atlanta, GA area right now.

AI Assistant is available now!