What are the responsibilities and job description for the Site Reliability Engineer position at Rapsys Technologies?
Proven experience in high-availability, high-transaction environments (preferably payments or financial services).
Strong background in production resiliency and recovery (recovery execution, runbooks/playbooks, RCA mindset).
Incident pattern analysis MTTR baselines (P2 Major/Minor) and recurring failure taxonomy (by rail/service).
Senior-level observability expertise: dashboards, monitors, and alerts (Datadog preferred; similar tools considered).
Splunk, Datadog, SQLS, JQL Jira Query language, Gitlab,
Experience of CI / CD metrics and generating code quality, changes, testing automation executives reports from Gitlab Understand quality of stories, metrics, monitoring experiences - help get data to showcase deficiencies Senior CI/CD experience: pipeline design/operation, release safety patterns, and rollback readiness.
Experience using metrics and monitoring data to identify and communicate deficiencies.
Automation skills: Python and/or PowerShell (or equivalent) for building repeatable recovery workflows and operational tooling.
Kubernetes/container platform production troubleshooting (deployments, pods, config drift, safe restarts, and "why did this change break prod" investigations
Experience with identity/credentials/
Batch/scheduler/job-execution reliability (detecting/preventing silent job failures, validating multi-D scenarios, and building controls to ensure scheduled processing does not impact customers).
Distributed integration failure-handling (timeouts, retries, backpressure, idempotency, duplicate prevention, and reconciliation-especially across vendor/downstream dependencies).
Nice-to-have (differentiators)
Experience with SRE-style reliability practices (SLO/SLI thinking, error budgets, operational metrics).
Experience with failover / DC flip / active-active or active-passive recovery concepts and scenario-based runbooks.
Cloud Engineering (Azure, AWS) Devops tools expertise, (Jenkins, Terraform, Sonar Cube, Helm Charts) Network & traffic-management incident triage (load balancers/firewalls/VLAN changes, DC traffic flips, and rapid isolation of "app vs infra vs network" to stabilize service)
Salary : $50 - $55