What are the responsibilities and job description for the Principal Kafka Support & Reliability Engineer position at Purple Drive Technologies LLC?
Role: Principal Kafka Support & Reliability Engineer
Location: Canton, MA
Role Descriptions: Tier 3 Incident Management Escalation SupportAct as the highest technical escalation point for Kafka production incidents Sev 1 Sev 2.Lead deep troubleshooting across 1. Broker instability| controller elections| ISR shrinkage2. Under replicated partitions and leader imbalance3. Producerconsumer failures| lag spikes| and rebalance stormsDisk| network| JVM| and request handler saturationProvide hands on remediation for complex issues| including Partition reassignment and leader rebalanceBroker configuration tuningThrottlequota strategies for noisy producers or consumersCoordinate with vendor support during service incidents| providing logs| metrics| and forensic details.Guide Tier 2 teams during major incidents and validate restoration actions.2. Kafka Performance Engineering OptimizationAnalyze Kafka workloads for performance and scalability risks Partition skew and hot partitionsInefficient producer batchingcompressionConsumer lag root cause analysisThread pool| IO| and network bottlenecksRecommend and validate Topic design (partition count| replication factor| retention| compaction)Producer and consumer configuration best practicesQuotas| quotas enforcement| and multi tenant controlsSupport onboarding of high throughput or latency sensitive workloads| ensuring Kafka is correctly sized and tuned.3. Platform Stability| Reliability ResilienceDiagnose and resolve systemic Kafka stability issues Repeated broker failures or flappingMetadatacontroller instability (Zookeeper or KRaft)Recovery issues following failovers or maintenance eventsSupport resilience initiatives Multi AZ cluster health validationReplication and DR strategies (MirrorMaker 2| Replicator| or app level DR patterns)Failover testing and validationDefine and improve Kafka SLOs for availability| durability| and latency.4. Change| Upgrade Configuration LeadershipLead medium to high risk Kafka changes| including Broker and cluster configuration changesPartition expansion or large scale reassignmentTopic policy changes impacting durability or performanceSupport and plan Kafka version upgradesMSK Confluent upgrade cyclesClient compatibility and rollout strategiesParticipate in CAB reviews| assess risk| and design rollback and validation plans.5. Root Cause Analysis Continuous ImprovementOwn RCA documentation for major incidents with clear corrective and preventive actions (CAPA).Identify recurring failure patterns and architectural gaps.Recommend platform-level improvements Automation opportunitiesGuardrails and standardsMonitoring and alerting enhancementsContribute to continuous improvement of runbooks| knowledge base articles| and operational playbooks.
Essential Skills: Role OverviewThe Kafka Tier 3 Support Engineer is a senior technical role responsible for expert level support| advanced troubleshooting| performance engineering| and platform stabilization of enterprise Apache Kafka environments. This role functions as the final technical escalation point for Kafka-related production incidents and is accountable for root cause analysis (RCA)| complex remediation| and long term prevention. The engineer works closely with Tier 2 operations| Platform Engineering| SRE teams| application teams| and vendor support (AWS MSK Confluent Cloud providers) to ensure Kafka remains a highly reliable| scalable| and secure streaming backbone.
Desirable Skills:
Keyword:
Skills: Digital : Kafka~Digital: Amazon Connect~Digital : Kubernetes Experience Required: 10 & Above