What are the responsibilities and job description for the Data Engineer - GCP position at CoreAi Consulting?
We are looking for a GCP Data Engineer with 5 years of experience to build scalable, reliable data pipelines across the full data lifecycle—ingestion, transformation, orchestration, and serving—using GCP services and Apache Spark.
The ideal candidate has strong big data fundamentals, hands-on PySpark experience, expertise in both batch and real-time streaming, and a focus on performance optimization and clean data modeling.
Key Responsibilities
- Design, build, and maintain batch and streaming pipelines using Dataflow, Dataproc, Pub/Sub, and Cloud Composer
- Develop scalable data transformations using PySpark and Spark SQL
- Implement data ingestion from databases, APIs, files, and event streams
- Build real-time streaming solutions with Pub/Sub and Dataflow (Apache Beam), including handling windowing, late data, and watermarking
- Design event-driven architectures for real-time analytics
- Orchestrate workflows using Cloud Composer (Airflow DAGs)
- Optimize Spark jobs by addressing data skew, shuffle issues, memory usage, and partitioning strategies
- Design and manage BigQuery schemas using dimensional modeling and Lakehouse patterns
- Implement data quality, validation, and monitoring within pipelines
- Collaborate with stakeholders to translate business needs into data models
- Maintain documentation, runbooks, and follow Agile and CI/CD practices
Qualifications and Required Skills
- 5 years of experience in Data Engineering with GCP, Python, PySpark, and SQL
- Strong expertise in BigQuery (advanced SQL, partitioning, clustering, cost optimization)
- Experience with Dataflow (Apache Beam) for batch and streaming pipelines
- Hands-on experience with Dataproc / Apache Spark (PySpark, Spark SQL, performance tuning)
- Experience with Pub/Sub (event design, delivery semantics, deduplication)
- Experience with Cloud Composer (Airflow) for workflow orchestration
- Experience with Cloud Storage integration and lifecycle management
- Strong understanding of distributed data processing concepts (partitioning, shuffling, fault tolerance)
- Solid understanding of Spark internals (execution model, DAGs, Catalyst optimizer, Spark UI debugging)
- Familiarity with data formats such as Parquet, ORC, Avro, and Delta Lake
- Knowledge of streaming concepts (windowing, triggers, exactly-once vs at-least-once processing)
- Experience with data modeling (star/snowflake schemas, SCD Type 1/2)
- Experience with lakehouse architectures (Delta Lake/Iceberg with BigQuery)
- Experience with incremental data loads and large-scale partitioned datasets
- Experience with performance tuning (joins, partitioning, caching, file sizing)
- Familiarity with CI/CD pipelines (Cloud Build or GitHub Actions)
- Experience using Git for version control
- Basic knowledge of Terraform for infrastructure management
- Experience building Python-based APIs (e.g., FastAPI) for data services