Demo

Data Engineer

Aldea
San Francisco, CA Full Time
POSTED ON 11/27/2025 CLOSED ON 12/26/2025

What are the responsibilities and job description for the Data Engineer position at Aldea?

About Aldea

Aldea is a multi-modal foundational AI company reimagining the scaling laws of intelligence. We believe today's architectures create unnecessary bottlenecks for the evolution of software. Our mission is to build the next generation of foundational models that power a more expressive, contextual, and intelligent human–machine interface.

The Role

We are hiring a Data Engineer to build the data infrastructure that powers Aldea's multi-modal AI research. You will design and scale data pipelines for pretraining, midtraining, and post-training at trillion-token scale, process diverse data sources across language and speech domains, and generate high-quality synthetic data for model training.

This is a high-impact role where your work directly determines training quality and efficiency. If you're passionate about building data systems that power cutting-edge AI research, this role is for you.

What You'll Do

  • Build and scale data pipelines for pretraining, midtraining, and post-training at trillion token scale across language and speech domains
  • Process and curate large-scale datasets including cleaning, deduplication, quality filtering, and optimization for distributed training
  • Generate synthetic data for model training and evaluation across diverse tasks and domains
  • Design efficient data loading systems achieving high throughput across multi-node training clusters
  • Build data versioning and reproducibility systems to track dataset compositions and enable reproducible experiments
  • Collaborate with ML engineers and researchers to optimize pipelines and improve data quality

Minimum Qualifications

  • Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
  • 3 years of experience building large-scale data pipelines for machine learning or data-intensive applications
  • Strong programming skills in Python and experience with data processing frameworks (Spark, Dask, Ray, or similar)
  • Experience with data quality techniques including deduplication, filtering, and validation at scale
  • Proven ability to optimize data pipelines for performance and throughput in distributed systems
  • Experience working with large datasets (100GB-10TB ) and understanding of storage systems and data formats

Preferred Qualifications

  • Experience building data pipelines for LLM pretraining or large-scale ML training
  • Hands-on experience with synthetic data generation for language or speech models
  • Experience with text processing at scale: tokenization, deduplication (MinHash, LSH), and quality assessment
  • Familiarity with audio/speech data processing and dataset curation
  • Knowledge of data contamination detection and dataset versioning best practices
  • Experience optimizing data loaders for PyTorch or TensorFlow at scale
  • Understanding of distributed storage systems (S3, GCS, HDFS) and data streaming patterns

Compensation & Benefits

  • Competitive base salary
  • Performance-based bonus aligned with research and model milestones
  • Equity participation
  • Comprehensive health, dental, and vision coverage
  • Flexible paid time off

Aldea is proud to be an equal-opportunity employer. We are committed to building a diverse and inclusive culture that celebrates authenticity to win as one. We do not discriminate on the basis of race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, disability, protected veteran status, citizenship or immigration status, or any other legally protected characteristics.

Aldea uses E-Verify to confirm employment eligibility in compliance with federal law. For more information please visit: https://www.e-verify.gov.

Please note: We do not accept unsolicited resumes from recruiters or employment agencies and will not be responsible for any fees related to unsolicited resumes.
Senior Software Engineer, Platform
People Data Labs -
San Francisco, CA
Senior Site Reliability Engineer (SRE)
People Data Labs -
San Francisco, CA
Data Engineer - Data Engineering
Plaid -
San Francisco, CA

Salary.com Estimation for Data Engineer in San Francisco, CA
$118,453 to $148,813
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Data Engineer?

Sign up to receive alerts about other jobs on the Data Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$92,929 - $122,443
Income Estimation: 
$122,257 - $154,284
Income Estimation: 
$122,257 - $154,284
Income Estimation: 
$143,391 - $179,890
Income Estimation: 
$168,522 - $211,152
Income Estimation: 
$189,259 - $248,928
Income Estimation: 
$71,122 - $96,652
Income Estimation: 
$92,929 - $122,443
This job has expired.
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Aldea

  • Aldea Fort Worth, TX
  • Ver más abajo para la versión en español Maintenance Supervisor Are you good at fixing things and leading a team? Do you enjoy keeping communities running ... more
  • 5 Days Ago

  • Aldea San Francisco, CA
  • About Aldea Aldea is building frontier AI infrastructure: high-accuracy speech-to-text, low-latency text-to-speech, and long-context LLM systems designed f... more
  • 5 Days Ago


Not the job you're looking for? Here are some other Data Engineer jobs in the San Francisco, CA area that may be a better fit.

  • Tessera Data San Francisco, CA
  • About Checkr Checkr is building the data platform to power safe and fair decisions. Established in 2014, Checkr’s innovative technology and robust data pla... more
  • 2 Days Ago

  • Tessera Data San Francisco, CA
  • About Checkr Checkr is building the data platform to power safe and fair decisions. Established in 2014, Checkr’s innovative technology and robust data pla... more
  • 12 Days Ago

AI Assistant is available now!

Feel free to start your new journey!