What are the responsibilities and job description for the DevOps & Site Reliability Lead position at iPeople Infosystems LLC?
Job Title : DevOps & Site Reliability Lead
Location : Deerfield, IL
Type: Fulltime Position
Job description :
Must Have Technical/Functional Skills
- We are seeking a Site Reliability Engineer (SRE) with strong expertise in Talend and Big Data platforms to support and operate large-scale data processing environments.
- The role requires close collaboration with customers, application teams, and offshore delivery teams to ensure platform reliability, incident management, and operational excellence. Experience with Databricks is a strong plus.
Key Responsibilities
- Act as an SRE for Big Data and ETL platforms, ensuring high availability, performance, and reliability of data pipelines and applications.
- Provide operational support and incident management (MIM), including triage, root cause analysis, and resolution of production issues.
- Serve as a primary point of contact for customers, providing timely updates, issue resolution, and operational insights.
- Collaborate closely with application teams to support ETL jobs, data processing workflows, and platform enhancements.
- Coordinate with offshore teams for day-to-day operations, incident resolution, and continuous improvement initiatives.
- Monitor, troubleshoot, and optimize Talend, Hadoop, Spark, and Big Data ecosystems.
- Implement and support monitoring, alerting, runbooks, and automation to improve platform stability and reduce manual effort.
- Participate in problem management, change management, and post-incident reviews to drive preventive measures.
- Support capacity planning, performance tuning, and reliability improvements across the data landscape.
Required Skills & Qualifications
- Strong hands-on experience with Talend (development, support, and troubleshooting).
- Solid understanding of Big Data technologies, including:
o Hadoop ecosystemo Apache Spark
- Proven experience handling Major Incident Management (MIM) and production support in a 24x7 or on-call environment.
- Experience working directly with customers, business stakeholders, and cross-functional teams.
- Strong coordination skills to manage and guide offshore teams.
- Knowledge of ITIL processes, especially Incident, Problem, and Change Management.
- Excellent communication, documentation, and stakeholder management skills.
Roles & Responsibilities
- Act as an SRE for Big Data and ETL platforms, ensuring high availability, performance, and reliability of data pipelines and applications.
- Provide operational support and incident management (MIM), including triage, root cause analysis, and resolution of production issues.
- Serve as a primary point of contact for customers, providing timely updates, issue resolution, and operational insights.
- Collaborate closely with application teams to support ETL jobs, data processing workflows, and platform enhancements.
- Coordinate with offshore teams for day-to-day operations, incident resolution, and continuous improvement initiatives.
- Monitor, troubleshoot, and optimize Talend, Hadoop, Spark, and Big Data ecosystems. < li>Implement and support monitoring, alerting, runbooks, and automation to improve platform stability and reduce manual effort.
- Participate in problem management, change management, and post-incident reviews to drive preventive measures.
- Support capacity planning, performance tuning, and reliability improvements across the data landscape.