What are the responsibilities and job description for the Site Reliability Engineer position at Blankfactor?
This position is as a full time position supporting the financial services/payments space and is fully onsite 5 days per week with some on call support (rotation basis) in Berkeley Heights, NJ. Please apply only if you have experience with a valid work authorization. Unfortunately we cannot work via C2C or C2H, this is W2.
About Blankfactor
At Blankfactor, we are dedicated to engineering impact. We build high-quality tech solutions for companies looking to innovate and grow—especially in fast-moving industries like payments, banking, capital markets, and life sciences.
About the Role
As a Site Reliability Engineer, you will ensure the reliability, availability, and performance of mission-critical platforms by building scalable systems, robust automation, and data-driven operations. You will partner closely with development, cloud, infrastructure, and security teams to deliver resilient, high-performing services that support the way people live and work today.
- Design and implement solutions that enhance application reliability, performance, scalability, and resilience.
- Build and maintain monitoring, alerting, observability, and telemetry to drive proactive detection and rapid incident response.
- Lead incident management efforts, perform root cause analysis, and implement action- oriented post-mortem improvements.
- Automate operational workflows using scripting, IaC, and configuration management tools.
- Analyze capacity, performance, and usage trends to forecast demand and optimize cloud costs.
- Collaborate with engineering teams to embed operability, resilience, and security into application and architecture designs.
- Support safe, reliable deployments through CI/CD pipelines, release governance, and change control.
- Maintain clear runbooks, architecture diagrams, and operational documentation that enable efficient production support.
Required Qualifications
- Managing Kubernetes and containerized workloads (EKS, AKS, GKE), including scaling, networking, upgrades, and orchestration.
- Experience in public cloud platforms (AWS, Azure, or GCP) across compute, storage, networking, IAM, and cost governance.
- Using observability and APM tools such as Dynatrace, Splunk, Prometheus, Grafana, Datadog, ExtraHop, etc.
- Implementing security and compliance controls in regulated environments (e.g., PCI DSS, SOC 2), including secrets management and vulnerability remediation.
- Infrastructure as Code experience using Terraform, CloudFormation, Ansible, or similar tools.
- Designing and maintaining CI/CD pipelines using Jenkins, GitLab CI, GitHub Actions, or Azure DevOps.
- Scripting and automation using Bash, PowerShell, or Python.
- Equivalent combination of education, experience, and/or military background
Nice to Have
- Certifications such as AWS SysOps Administrator, AWS DevOps Engineer, Google Cloud DevOps Engineer, or CKA.
- Experience with Premier applications, IBM iSeries, and/or Unisys systems.
- Hands-on database operations and performance tuning (Oracle, SQL Server, PostgreSQL).
- Proven experience in major incident command, stakeholder communication, and cross-team coordination.
- Experience with ITIL and ServiceNow (change, problem, and configuration management).