What are the responsibilities and job description for the Data Engineer position at Avance Consulting?
Data Engineer (Hadoop & On-Premise Cloud)
Charlotte, North Carolina | Plano, Texas (Onsite)
Permanent/Full Time
Required Skill and Experience (Must-Have)
End-to-End Implementation: Proven experience in end-to-end project implementation using Hadoop (HDFS, Hive, HBase), PySpark, and On-Premise Cloud/Private Cloud infrastructures (e.g., OpenStack, VMware, or bare-metal clusters).
Core Hadoop Ecosystem: Strong knowledge and hands-on experience in HDFS, Hive, YARN, MapReduce, and HBase.
PySpark Development: Hands-on experience in designing and developing ingestion pipelines, ETL pipelines, and batch/streaming jobs using PySpark (must have, Scala-Spark is a plus but not required).
On-Premise Cluster Management: Experience in on-premise cluster configuration, resource management (CPU/Memory), Namenode/Datanode administration, and job scheduling using Apache Airflow, Oozie, or Control-M.
Performance Tuning: Proven ability in handling large datasets (TB/PB scale) and applying performance optimization techniques for data ingestion and retrieval in on-premise Hadoop environments (e.g., partitioning, bucketing, file format selection like Parquet/ORC).
Preferred Skill and Experience (Good-to-Have)
Software Engineering Best Practices: Knowledge of design patterns, data structures, algorithms, collections, multi-threading, memory management, and concurrency in Python.
Data Warehousing & SQL: Strong SQL skills for Hive/Impala queries and experience with traditional on-premise data warehouses.
Migration Experience: Experience migrating data from legacy on-premise systems (e.g., Oracle, Teradata, Netezza) to modern Hadoop-based data lakes on private cloud.
Workflow Management: Hands-on experience with workflow orchestration tools like Apache Airflow, NiFi, or Oozie for complex dependency management.
Visualization: Exposure to Power BI, Tableau, or QlikView connecting to Hadoop/Hive.
Agile Methodologies: Understanding of Scrum, Kanban, or other Agile frameworks.
Domain Knowledge: Experience in Banking, Financial Services, or Insurance (BFSI) domain, including familiarity with regulatory reporting, data governance, or compliance (GDPR, BCBS, etc.).
Team Collaboration: Ability to work effectively in a diverse, multi-stakeholder environment comprising Business users, Data Scientists, and IT infrastructure teams.
Key Responsibilities (To be added as needed)
Design, build, and maintain scalable data pipelines using PySpark on on-premise Hadoop clusters.
Manage and optimize Hive metastore, HDFS storage, and YARN resource queues for multi-tenant workloads.
Implement data validation, error handling, and reconciliation mechanisms for batch and real-time data.
Collaborate with infrastructure teams to tune on-premise cloud resources (compute, storage, network) for Spark workloads.
Migrate existing ETL workflows from legacy systems to Hadoop ecosystem.
Salary : $100,000