What are the responsibilities and job description for the Data Scientist position at OYF (Own Your Future) Staffing?
Job Description: Senior Multimodal Foundation Model Engineer
Location: Seattle, WA
Compensation: $240K–$300K salary 0.5%–1% equity
Work Policy: On-site, 5 days/week
Visa Support: Visa & Green Card sponsorship available
About the Role
This role offers a rare opportunity to shape truly foundational technology in a space where the boundaries are still being defined. You will be part of a small, high-performing team building the first real-time human foundation model capable of understanding and generating text, speech, facial expression, and body language as a unified system.
You’ll work on technology that interprets the micro-signals humans intuitively use — the quirk of an eyebrow, a pause in speech, a shift in tone — and builds models that can understand and respond with emotional intelligence.
Your work will power lifelike, responsive avatars whose expressions, gestures, and tone evolve naturally frame-by-frame to deliver deeply human interactions.
This is a role for someone who wants to build at the frontier of multimodal AI, push scientific boundaries, and work hands-on at massive scale.
What You’ll Do
- Design, train, and optimize large multimodal and autoregressive models that operate across text, speech, and visual signals in real time.
- Build systems that understand fine-grained human cues and infer nuanced intent and emotion.
- Develop lifelike avatar generation systems capable of natural facial expression, gesture, and tone rendering.
- Lead model training end-to-end, from data pipeline design to pre-training to evaluation and iteration.
- Collaborate closely with a world-class founding team to drive architectural decisions, establish research direction, and experiment rapidly.
- Work in a fast-paced, flat, highly collaborative environment where you will have significant ownership and influence.
Required Qualifications
Experience
- 3 years training multimodal LLMs, MLLMs, autoregressive architectures, or closely related models.
- Hands-on experience with large-scale pre-training and familiarity with full model training pipelines.
- Prior experience training models in corporate or advanced research environments.
Education
- Degree in Computer Science, Mathematics, or Engineering from a top-tier institution.
- PhD (or PhD-level research experience) with a focus on speech synthesis, multimodal modeling, or related fields.
Technical Skills
- Deep understanding of large language models, especially multimodal systems combining text, audio, and visual data.
- Demonstrated ability to train models at large scale (e.g., distributed training across 32 GPUs).
- Strong understanding of model architecture, inference optimization, and large-scale data processing.
Soft Skills
- Low ego, collaborative, and easy to work with.
- Genuine interest in committing to a startup environment and building foundational technology.
- Strong communication and willingness to iterate quickly.
Why Join
- Exceptional founding team with deep expertise across AI, speech, embodied intelligence, and real-time modeling.
- Work on groundbreaking technology: building the first human foundation model that unifies real-time emotional and social intelligence across modalities.
- Clear impact: your work directly shapes the core product and technical direction.
- Flat, collaborative structure where top performers can influence decisions and experiment freely.
- Mission-driven environment focused on creating AI that interacts with people more naturally and meaningfully.
- Strong funding and an ambitious vision spanning AI companionship, enterprise workflows, interviewing, sales intelligence, and more.
Salary : $240,000 - $300,000