What are the responsibilities and job description for the SRE/MQ Infrastructure Lead position at Stash Talent Services?
Position: SRE/MQ Infrastructure Lead
Location: Plano, TX (3 days in office, 2 days remote)
Duration: 12-month contract (ext. up to 36 months)
Job Details:
We are seeking an experienced Site Reliability Engineer (SRE)/ MQ Infrastructure Lead for Messaging Services to drive platform reliability, observability, and operational excellence across IBM MQ and Kafka environments. This role combines production engineering and reliability leadership, platform security and resilience engineering, and ownership of large-scale, distributed messaging runtimes. The position has a hybrid schedule requiring a minimum of 3 days per week on-site.
Responsibilities:
- Leading reliability engineering for high-scale messaging platforms supporting tens of thousands of runtimes and high-volume message throughput
- Driving EOL remediation, patching, and stabilization across MQ queue managers and Kafka clusters
- Implementing SRE best practices such as SLIs / SLOs focused on message delivery, latency, and availability, and incident management, escalation, and postmortem culture
- Enhancing observability and monitoring for messaging flows, queue depths, lag, and throughput
- Designing proactive fault detection and auto-remediation strategies (e.g., DLQ handling, backlog mitigation, failover recovery)
- Building resilient messaging platforms capable of supporting real-time, event-driven workloads
- Supporting global production messaging environments with on-call rotation and escalation ownership
- Partnering with engineering, application, and security teams to ensure reliability, scalability, and secure message transport
Requirements:
- Strong experience in Site Reliability Engineering / Production Engineering
- Hands-on expertise with IBM MQ (queue managers, clustering, channels, DLQ management), Kafka / Confluent platform (topics, brokers, partitions, consumer groups), and large-scale distributed messaging systems and runtime management
- Deep understanding of system reliability, scalability, and high availability design; messaging reliability patterns (guaranteed delivery, retry handling, replay, ordering); and incident management, root cause analysis, and problem management
- Experience with observability tools (Dynatrace, Splunk, Prometheus, Grafana) for messaging platforms and event and anomaly detection in high-volume systems
- Strong scripting/automation skills in Shell, Python, PowerShell
- Experience managing Linux/Unix and Windows production environments
- Knowledge of event-driven architecture and messaging-based integration patterns
- Understanding of messaging platform security (TLS, certificates, channel auth, encryption) and vulnerability remediation and risk mitigation in production systems
- Excellent troubleshooting skills in high-pressure, real-time environments (e.g., message backlog, latency spikes, connection failures)
Desired skills:
- Experience implementing SRE frameworks (SLIs, SLOs, error budgets) specifically for messaging workloads
- Familiarity with Kubernetes / containerized messaging platforms
- Experience with Kafka ecosystem components (Schema Registry, Connect, Streams) and IBM MQ advanced features (Native HA, clustering)
- Exposure to AI-driven operations (AIOps), anomaly detection, or automated remediation and large-scale messaging modernization or migration programs
- Messaging or middleware certifications (IBM MQ, Kafka, or equivalent)
- Experience in regulated environments (e.g., financial services)
Salary : $60