Demo

Data Engineer (Web Data)

abakaai
Palo Alto, CA Full Time
POSTED ON 12/4/2025
AVAILABLE BEFORE 2/3/2026
About Abaka AI
 
Abaka AI is built on one mission: to be the world’s most trusted data partner for AI companies. More than 1,000 industry leaders across Generative AI, Embodied AI, and Automotive AI rely on us to power their data pipelines. With our headquarters in Silicon Valley—and teams in Paris, Singapore, and Tokyo—we support global partners with fast, reliable, and scalable data solutions.
Our offerings include a diverse catalog of off-the-shelf datasets (image, video, multimodal, reasoning, 3D, and beyond) as well as comprehensive data collection and annotation services. Whether teams need raw data, curated datasets, or full-cycle data engineering, Abaka AI provides the foundation for building high-performance AI systems.
 
 
About the Role
 
We’re hiring a Data Engineer (Web Data) focused on Web Crawling in the United States, a foundational role that will shape how Abaka AI acquires high-quality web-scale data to power multimodal AI systems. You’ll design, build, and maintain robust crawling infrastructure that supports large-scale data collection across diverse domains and formats.
This role blends low-level system design with real-world operational problem-solving. You’ll work closely with data engineering and research teams to define crawling targets, implement anti-bot resilient architectures, manage proxies, and transform raw web content into structured datasets optimized for AI training and evaluation.
As an early technical hire, you’ll play a key role in setting standards for reliability, scalability, and data quality across our web data pipelines. If you're excited about building distributed systems, solving complex scraping challenges, and enabling the next generation of frontier AI models, this role offers the opportunity to make a lasting impact.
 
 

Responsibilities

  • Collaborate closely with clients to understand their data requirements, and coordinate internal teams to create tailored delivery plans that ensure on-time, high-quality data delivery, including meeting expectations for format, precision, and volume.
  • Lead the development of mid- to long-term plans for the data engineering function. Build scalable, end-to-end pipelines for multimodal data (text, image, audio, video, 3D point cloud, etc.), covering data sourcing, cleaning, annotation, QA, storage, and iterative optimization for training, fine-tuning, and evaluation.
  • Develop solutions to core technical challenges in multimodal data processing, such as cross-modal alignment (for example, image-text semantic matching), large-scale data cleaning (deduplication, denoising, format normalization), annotation efficiency, and data encryption and security.
  • Work cross-functionally with algorithm, product, and business teams by providing feedback to model teams on data bottlenecks, helping refine internal tools and services, and supporting client-facing teams with technical documentation and pre-sales materials.
  • Evaluate and optimize the cost structure of data processing operations, including headcount, infrastructure, and tooling, to balance quality, efficiency, and scalability.
 

Qualifications

  • Strong background in computer science, data engineering, artificial intelligence, or related fields, with hands-on experience working with large-scale data systems.
  • 3 years of experience in data engineering or data operations. Leadership experience is highly valued, and prior involvement in LLM or multimodal dataset preparation is a strong plus.
  • Must-have technical skills: Strong Python proficiency; HTML/DOM parsing (lxml, XPath); HTTP internals; advanced Scrapy; async crawling (aiohttp/asyncio); Playwright/Selenium; familiarity with browser internals.
  • Deep understanding of end-to-end multimodal data workflows, with practical experience in at least two modalities, such as text, images, audio, or video.
  • Proficiency in designing technical architectures for large-scale data pipelines, including distributed processing and automation frameworks. Familiarity with data privacy and security best practices such as access control and data anonymization.
  • Strong execution and team management skills, with the ability to translate high-level objectives into actionable plans and drive team results.
  • Excellent communication and cross-functional collaboration skills, with the ability to clearly communicate technical and operational requirements, resolve conflicts, and manage stakeholder expectations.
  • High sense of ownership and resilience, with comfort operating in a fast-paced, evolving AI environment and the ability to navigate urgent delivery timelines.
 

Compensation & Benefits

The base salary range for this position is $175,000 - $250,000 USD annually.
Compensation may vary outside of this range depending on a number of factors, including a candidate’s qualifications, skills, competencies and experience. Base pay is one part of the Total Package that is provided to compensate and recognize employees for their work at Abaka AI. This role is eligible for equity, as well as a comprehensive benefits package (health, dental, vision, PTO, flexible work schedule).

Salary : $175 - $250

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Data Engineer (Web Data)?

Sign up to receive alerts about other jobs on the Data Engineer (Web Data) career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$92,929 - $122,443
Income Estimation: 
$122,257 - $154,284
Income Estimation: 
$122,257 - $154,284
Income Estimation: 
$143,391 - $179,890
Income Estimation: 
$168,522 - $211,152
Income Estimation: 
$189,259 - $248,928
Income Estimation: 
$71,122 - $96,652
Income Estimation: 
$92,929 - $122,443
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Not the job you're looking for? Here are some other Data Engineer (Web Data) jobs in the Palo Alto, CA area that may be a better fit.

  • Abaka AI|Data Annotation / Collection / Processing for AI San Jose, CA
  • 【What you need to participate in】 Research on the core algorithms of deep learning (including various neural network structures and applications), includin... more
  • 2 Months Ago

  • Data Capital Inc Santa Clara, CA
  • Job Details Mandate Skills: FPGA Verification Exp Strong SystemVerilog coding (Universal Verification Methodology)-UVM 5 years of FPGA verification experie... more
  • 1 Month Ago

AI Assistant is available now!

Feel free to start your new journey!