What are the responsibilities and job description for the Perception Algorithm Engineer position at Black Sesame Technologies Inc?
Autonomous Driving Multimodal Model Algorithm Engineer
VLM / VLA / World Model
Black Sesame Technologies is building high-performance AI algorithms and self-developed chips for intelligent driving and beyond. As an Autonomous Driving Multimodal Model Algorithm Engineer, you will work on next-generation multimodal AI models for autonomous driving, including Vision-Language Models, Vision-Language-Action Models, and World Models.
You will collaborate with perception, prediction, planning, data, simulation, and deployment teams to integrate multimodal models with existing BEV perception, two-stage E2E, and one-stage E2E autonomous driving systems.
We are looking for candidates with hands-on experience in one or more of the following areas: Vision-Language Models, Vision-Language-Action Models, World Models.
Responsibilities
Multimodal Model Development for Autonomous Driving
- Work on one or more multimodal modeling directions for autonomous driving, including VLM-based scene understanding, VLA-style planning-oriented modeling, and World Model-based future prediction.
- Develop and optimize models that reason over multi-camera images, BEV features, map elements, object/lane instances, occupancy, trajectories, ego-motion, and driving context.
- Explore model architectures that connect perception, prediction, planning, and decision-making in two-stage and one-stage E2E autonomous driving systems.
- Collaborate with BEV perception and planning teams to improve representation quality, temporal consistency, long-tail robustness, and planning relevance.
Vision-Language and Vision-Language-Action Modeling
- Develop VLM-based methods for driving scene understanding, open-vocabulary perception, risk reasoning, corner-case analysis, and interpretable autonomy.
- Adapt and extend open-source multimodal architectures such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, or similar models for autonomous driving scenarios.
- Research VLA-style models that map multimodal driving context, navigation intent, and high-level instructions to trajectories, actions, or planning representations.
- Align visual, BEV, map, object, lane, occupancy, trajectory, and language representations for driving-specific tasks.
- Build supervised fine-tuning, instruction-tuning, and efficient adaptation pipelines for driving-relevant multimodal tasks.
World Model and Future Prediction
- Build world-model-based approaches for future BEV, occupancy, object motion, lane evolution, traffic interaction, and ego-conditioned scene rollout.
- Explore generative and predictive modeling methods such as diffusion models, autoregressive transformers, latent dynamics models, video prediction, and BEV prediction.
- Use learned world models for scenario generation, counterfactual reasoning, long-tail case mining, planning evaluation, and closed-loop analysis.
- Work with simulation and data teams to improve safety-critical scenario discovery and model-based evaluation.
Efficient Adaptation and Deployment
- Apply efficient fine-tuning and adaptation methods such as LoRA, QLoRA, Adapter, Prompt Tuning, Prefix Tuning, or other PEFT techniques.
- Develop multimodal feature alignment modules, including projection heads, query adapters, cross-attention modules, tokenization strategies, and representation converters.
- Optimize model architecture, latency, memory footprint, and compute cost for automotive deployment.
- Apply distillation, quantization, pruning, sparse computation, and efficient attention methods where appropriate.
- Collaborate with chip, compiler, runtime, and deployment teams to adapt multimodal models to in-house automotive AI hardware.
Research, Evaluation, and Iteration
- Track the latest research in VLM, VLA, World Models, BEV perception, E2E driving, robotics foundation models, generative simulation, and multimodal learning.
- Design evaluation metrics for reasoning quality, grounding accuracy, temporal consistency, prediction quality, planning relevance, and safety-critical scenarios.
- Perform systematic failure analysis and drive data/model iteration based on real-world autonomous driving cases.
- Contribute to patents, technical reports, internal research platforms, and conference or journal publications.
Qualifications
- MS or PhD in Computer Science, Electrical Engineering, Robotics, Artificial Intelligence, or a related field.
- Strong background in deep learning, computer vision, multimodal learning, robotics, or autonomous driving.
- Hands-on experience in one or more of the following areas:
- Vision-Language Models, multimodal large models, or open-source VLM adaptation
- Vision-Language-Action models, robotics foundation models, or action-conditioned modeling
- World models, generative prediction, latent dynamics modeling, or future scene simulation
- BEV perception, multi-view 3D perception, or end-to-end autonomous driving
- Motion prediction, planning, trajectory generation, or closed-loop evaluation
- Practical experience with open-source multimodal architectures such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, BLIP-style models, Flamingo-style models, or similar systems.
- Solid understanding of multimodal feature alignment, including vision-language alignment, cross-modal attention, visual tokenization, projection layers, query-based fusion, or embedding-space alignment.
- Experience with efficient fine-tuning or adaptation methods, such as LoRA, QLoRA, Adapter, Prompt Tuning, Prefix Tuning, supervised fine-tuning, or instruction tuning.
- Proficient in PyTorch and capable of modifying, training, debugging, and evaluating deep learning models.
- Familiar with transformer architectures, attention mechanisms, temporal modeling, and large-scale training.
- Experience with multimodal data, such as camera, radar, LiDAR, IMU, map, trajectory, language, or structured driving data.
- Strong engineering ability in Python; C /CUDA/TensorRT experience is a plus.
- Comfortable with Git, Docker, Linux, distributed training, and collaborative development workflows.
- Strong communication skills and ability to work across perception, planning, data, simulation, and deployment teams.
Preferred Qualifications
- Experience adapting or fine-tuning VLM/VLA models such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, or similar architectures.
- Experience with Hugging Face Transformers, PEFT, DeepSpeed, FSDP, vLLM, SGLang, TensorRT-LLM, or similar training/inference frameworks.
- Experience building multimodal instruction datasets, driving-scene QA datasets, grounding datasets, scene-reasoning datasets, or planner-oriented supervision signals.
- Experience aligning multimodal model representations with BEV features, object queries, lane instances, occupancy grids, map vectors, trajectories, or planner inputs.
- Experience with autonomous driving architectures such as BEVFormer, DETR/DINO, MapTR/MapQR, occupancy networks, diffusion planners, trajectory transformers, or similar models.
- Experience with world models, generative models, video prediction, future BEV prediction, occupancy forecasting, learned simulation, or closed-loop evaluation.
- Experience with efficient adaptation of large models, including LoRA/QLoRA, distillation, quantization, pruning, sparse attention, or lightweight adapter design.
- Experience deploying deep learning models on automotive SoCs, ASICs, GPUs, or edge AI accelerators.
- Publications or strong project experience in CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, CoRL, ICRA, IROS, RSS, or related autonomous driving and robotics venues.
- Strong ability to convert research ideas into robust production systems.
- Experience with AI agent tools and basic harness engineering, including building evaluation scripts, task runners, automated workflows, tool-use pipelines, and reproducible testing environments for model or agent development.