What are the responsibilities and job description for the Member of Technical Staff (MTS) - Multimodal Foundation Models position at Deeproute.ai?
Focus
Multimodal Foundation Models · Representation Learning · Method Innovation
We are looking for strong technical builders and researchers who deeply understand foundation models and representation learning beyond simply applying existing frameworks.
Ideal candidates should have:
- Strong experimental rigor
- Solid systems and modeling intuition
- Hands-on engineering ability
- Interest in scalable multimodal AI systems for real-world autonomy
We value people who can bridge research and production, and who care about robustness, scalability, efficiency, and practical deployment in large-scale autonomous driving systems.
Responsibilities
1. Large-Scale Foundation Model Pretraining
- Develop scalable pretraining pipelines for large-scale multimodal driving data
- Design and optimize training strategies for:
- Vision-language-action models
- Video foundation models
- Long-context temporal modeling
- Multimodal representation alignment
- Improve:
- Training stability
- Data efficiency
- Scaling efficiency
- Representation robustness
- Work on distributed training systems and large-scale model optimization using frameworks such as:
- PyTorch Distributed
- DeepSpeed
- Megatron-LM
2. Representation Learning & Method Innovation
- Design and improve self-supervised and multimodal learning methods for real-world autonomous driving systems
- Conduct architecture-level research on:
- Vision Transformers (ViT)
- Video / temporal architectures
- Multimodal fusion and alignment
- Embedding and retrieval systems
- Long-context and memory-efficient architectures
- Explore and improve:
- Pretraining objectives
- Loss functions
- Training paradigms
- Generalization and robustness
- Analyze model behavior through:
- Rigorous ablation studies
- Failure case analysis
- Representation probing and evaluation
3. Efficient Foundation Models & Scalable Deployment
- Improve the efficiency, scalability, and deployability of large multimodal foundation models for real-world autonomous driving systems
- Work on areas such as:
- Model quantization
- Knowledge distillation
- Efficient attention mechanisms
- Sparse architectures and Mixture-of-Experts (MoE)
- Long-context and memory-efficient modeling
- Inference acceleration and serving optimization
- Training and inference system efficiency
- Optimize model throughput, latency, memory usage, and deployment performance for large-scale production environments
- Computer Vision
- Machine Learning
- Robotics
- Computer Science
- Related fields
- Foundation models
- Self-supervised learning
- Representation learning
- Multimodal learning
- Large-scale pretraining
- CLIP
- DINO / DINOv2
- MAE
- Contrastive learning
- Masked modeling
- MoE or scalable transformer architectures
- Video foundation models
- Long-context modeling
- Retrieval systems
- Efficient inference
- Distributed training
- Model compression and deployment optimization
- CVPR
- ICCV
- ECCV
- NeurIPS
- ICLR
- ICML