What are the responsibilities and job description for the AI Data Architect position at Genzeon?
AI Data Architect | Healthcare AI Platform
Genzeon Corporation — Healthcare Division
Exton, PA / Hybrid | 0–4 years | Full-time
AI native Product Architect-Exp in data engineering needed for product build out
The short version: We run a multi-model AI pipeline that processes 150K Medicare documents/year — faxed PDFs, EDI transactions, FHIR data, clinical notes. You’ll design and build the data architecture that ingests, stores, governs, and serves all of it to AI models and clinical reviewers. On-prem GPUs, hybrid cloud, HIPAA compliance. This is the real thing.
What you’ll do:
Design the end-to-end data architecture for a healthcare AI platform — ingestion,storage, processing, serving, governance Build pipelines for heterogeneous healthcare data: faxed PDFs, X12 EDI (835/837/278),FHIR R4, HL7v2, CMS files, unstructured clinical notes Architect the data lake/lakehouse layer (Apache Iceberg, MinIO, DuckDB,PostgreSQL/pgvector)
Design the embedding and vector storage layer that powers RAG — chunking, indexing, retrieval optimization Build data lineage tracking from source document to AI decision
Implement HIPAA/HITRUST data governance — encryption, access controls, audit logging, PHI handling Monitor data quality across the pipeline — schema drift, completeness, freshness, anomalies
Optimize for hybrid infrastructure: on-prem GPUs (RTX 5090, L40S), NAS, Azure GovCloud, Azure Commercial
What you need:
A data pipeline you’ve built that ran in production (we’ll ask about it)
SQL fluency and Python proficiency
Experience with at least one of: Spark, dbt, Airflow, Dagster, Prefect
Hands-on work with unstructured or semi-structured data — PDFs, images, OCR outputs, free text
Practical understanding of vector databases, embeddings, and how RAG systems consume data
Comfort with on-premises infrastructure, not just managed cloud services
Data quality and governance as instincts, not afterthoughts
Strong signals:
Healthcare data formats (X12 EDI, FHIR, HL7, CCD/C-CDA)
Apache Iceberg, Delta Lake, or modern table formats
MinIO / S3 / object storage architecture
pgvector, Pinecone, Weaviate, or similar vector stores
DuckDB or embedded analytical engines
HIPAA technical safeguards implementation
ML data pipelines — training data, feature stores, evaluation sets, feedback loops
We don’t require:
A data engineering bootcamp cert
Mastery of the entire “modern data stack”
Prior healthcare experience (but it helps)
A specific degree
To apply, submit:
1. Resume
2. Link to a data project you’ve built (GitHub, architecture diagram, write-up)
3. 200 words max: “Describe the messiest data problem you’ve encountered. How did you
solve it?”