Demo

AI/ML Systems Engineer - MoE Mapping

Etched
San Jose, CA Full Time
POSTED ON 4/28/2025
AVAILABLE BEFORE 6/28/2025
AI/ML Systems Engineer - MoE Mapping

About Etched

Etched is building AI chips that are hard-coded for individual model architectures. Our first product (Sohu) only supports transformers, but has an order of magnitude more throughput and lower latency than a B200. With Etched ASICs, you can build products that would be impossible with GPUs, like real-time video generation models and extremely deep & parallel chain-of-thought reasoning agents.

Job Summary

We are seeking a highly skilled and motivated ML Software Architect to join the Etched Inference Serving Stack team. You will be responsible for the deployment and optimization of large-scale AI models, particularly Mixture of Experts (MoE) architectures. This role focuses on developing strategies, software, and infrastructure for efficiently mapping and executing MoE models, including next-generation large language models, across distributed compute environments. You will play a pivotal role in ensuring the performance, scalability, and reliability of our largest AI workloads, collaborating closely with AI researchers and infrastructure teams.

Key responsibilities

  • MoE Model Mapping and Partitioning: Design, develop, and implement algorithms and software for partitioning and mapping MoE models (including experts and gating networks) across multiple Sohu accelerator hosts.

  • Distributed Systems Performance Optimization: Analyze and optimize the performance (latency, throughput, resource utilization) of distributed MoE model inference, focusing on computation, memory bandwidth, and network communication efficiency.

  • Distributed ML Deployment Frameworks: Design, build, and maintain software tools and frameworks automating deployment, scaling, and management of distributed MoE models.

  • Orchestration and Integration: Integrate MoE deployment strategies with cluster management and orchestration systems (e.g., Kubernetes, Slurm) and ML platforms (e.g., Kubeflow, Ray).

  • System Validation and Correctness: Develop and execute comprehensive test plans validating functionality, performance, scalability, and numerical correctness of distributed MoE model deployments.

  • Collaboration and Troubleshooting: Collaborate closely with ML engineers and infrastructure/hardware teams to understand inference stack and hardware capabilities/constraints, diagnosing and resolving complex distributed systems influencing large model deployment.

Representative projects

  • Develop and implement a novel mapping strategy for a large MoE language model onto a heterogeneous cluster of GPUs, TPUs or similar accelerators.

  • Build performance analysis tools to identify bottlenecks in distributed MoE inference across hundreds of nodes.

  • Optimize network communication patterns (e.g., expert routing via NCCL/MPI) for a specific MoE architecture.

  • Integrate automated, optimized MoE model deployment into an MLOps CI/CD pipeline.

  • Debug and resolve performance degradation or correctness issues in a large-scale MoE deployment.

  • Evaluate and compare different model parallelism techniques (e.g., expert, tensor, pipeline parallelism) for upcoming MoE models.

You may be a good fit if you have

  • Proficiency in Python and C , Pytorch, JAX.

  • Strong understanding of large language model architectures, particularly Mixture of Experts (MoE).

  • Solid understanding of distributed systems concepts, algorithms, and challenges (e.g., consensus, consistency, communication patterns).

  • Experience in developing and optimizing software for distributed computing environments.

  • Strong understanding of operating systems (Linux preferred) and underlying hardware, including accelerator architectures (GPUs, TPUs) and high-speed interconnect technologies (e.g., NVLink, InfiniBand).

  • Experience analyzing performance traces and logs from distributed systems and ML workloads.

Strong candidates may also have experience with

  • Experience with cluster orchestration technologies (Kubernetes, Slurm) and ML platforms (Ray, Kubeflow).

  • Experience with specific model parallelism libraries or frameworks (e.g., DeepSpeed, Megatron-LM, FairScale, GSPMD, vLLM).

  • Familiarity with high-performance communication libraries (e.g., MPI, NCCL).

  • Experience with ML-specific profiling and debugging tools (e.g., NVIDIA Nsight Systems, PyTorch Profiler, TensorBoard).

Ideal Background:

  • Candidates with experience deploying large-scale LLMs in distributed environments.

  • Candidates with a deep understanding and hands-on experience designing, implementing, or optimizing MoE architectures.

  • Candidates who have developed software specifically for optimizing distributed computation (e.g., communication optimization, load balancing, custom kernels).

  • Candidates experienced in performance analysis, bottleneck identification, and optimization for distributed AI/HPC workloads.

  • Candidates familiar with mapping complex computational graphs onto parallel and distributed hardware.

  • Candidates with a background in High-Performance Computing (HPC) applied to Machine Learning problems.

Benefits

  • Full medical, dental, and vision packages, with 100% of premium covered

  • Housing subsidy of $2,000/month for those living within walking distance of the office

  • Daily lunch and dinner in our office

  • Relocation support for those moving to West San Jose

How we’re different

Etched believes in the Bitter Lesson. We think most of the progress in the AI field has come from using more FLOPs to train and run models, and the best way to get more FLOPs is to build model-specific hardware. Larger and larger training runs encourage companies to consolidate around fewer model architectures, which creates a market for single-model ASICs.

We are a fully in-person team in West San Jose, and greatly value engineering skills. We do not have boundaries between engineering and research, and we expect all of our technical staff to contribute to both as needed.

Salary : $2,000

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a AI/ML Systems Engineer - MoE Mapping?

Sign up to receive alerts about other jobs on the AI/ML Systems Engineer - MoE Mapping career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$86,680 - $110,316
Income Estimation: 
$110,730 - $135,754
Income Estimation: 
$117,033 - $148,289
Income Estimation: 
$86,680 - $110,316
Income Estimation: 
$110,730 - $135,754
Income Estimation: 
$117,033 - $148,289
Income Estimation: 
$70,609 - $91,165
Income Estimation: 
$86,680 - $110,316
Income Estimation: 
$117,033 - $148,289
Income Estimation: 
$163,289 - $195,234
Income Estimation: 
$136,356 - $178,393
Income Estimation: 
$117,033 - $148,289
Income Estimation: 
$110,730 - $135,754
Income Estimation: 
$128,617 - $162,576
Income Estimation: 
$117,033 - $148,289
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Etched

Etched
Hired Organization Address San Jose, CA Full Time
About Etched Etched is building AI chips that are hard-coded for individual model architectures. Our first product (Sohu...
Etched
Hired Organization Address Cupertino, CA Full Time
About Etched Etched is building AI chips that are hard-coded for individual model architectures. Our first product (Sohu...
Etched
Hired Organization Address Cupertino, CA Full Time
About Etched : Etched is building AI chips that are hard-coded for individual model architectures. Our first product (So...
Etched
Hired Organization Address Cupertino, CA Full Time
Job Description Job Description About Etched Etched is building AI chips that are hard-coded for individual model archit...

Not the job you're looking for? Here are some other AI/ML Systems Engineer - MoE Mapping jobs in the San Jose, CA area that may be a better fit.

Principal AI/ML Engineer, Security AI

Cisco Systems, Inc., San Jose, CA

AI/ML Systems Design Engineer

Advanced Micro Devices, Inc, San Jose, CA

AI Assistant is available now!

Feel free to start your new journey!