Demo

Machine Learning Solutions Engineer (ML + Infrastructure Focus)

lightningai
York, NY Full Time
POSTED ON 6/22/2026
AVAILABLE BEFORE 8/21/2026

Who We Are

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction.

Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in.

We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.

 

What We're Looking For

Lightning is looking for a Machine Learning Solutions Engineer with a focus on ML and Infrastructure to join ou Sales team in New York. As a Machine Learning Solutions Engineer, you will operate at the intersection of machine learning, distributed systems, and cloud infrastructure. You will partner with customers to design and deploy end-to-end AI systems, spanning:

  • Model development and training
  • GPU infrastructure and cluster design
  • Distributed inference and production deployment

This role goes beyond traditional ML solutions engineering—you will act as a technical architect, helping customers make critical decisions across compute, orchestration, and system design.

The role is hybrid out of one of our hub locations (New York City, San Francisco, Seattle) with an in-office requirement of at least 2 days per week and occasional team and company offsites. We are not able to provide visa sponsorship for this role at this time.

 

What You’ll Do

Customer Architecture & Technical Leadership

  • Partner with customers to understand ML workloads, infrastructure constraints, and scaling requirements
  • Architect end-to-end solutions across:
    • Data pipelines (CPU → GPU workflows)
    • Distributed training (multi-node, multi-GPU)
    • High-throughput inference systems
  • Translate business goals (latency, cost, throughput) into technical system design decisions

GPU & Infrastructure Design

  • Design and optimize workloads across GPU clusters (H100, H200, B200, etc.)
  • Advise on:
    • Training vs inference cluster design
    • Interconnect choices (Ethernet vs Infiniband / RDMA vs Roce)
    • Storage strategies (local NVMe vs networked / object storage)
  • Model and optimize for:
    • Tokens/sec, tokens/$
    • Throughput vs latency tradeoffs
    • GPU utilization and scheduling efficiency

Kubernetes & Platform Systems

  • Design and support deployments on Kubernetes (EKS, GKE, on-prem clusters)
  • Work with:
    • GPU scheduling (time-slicing, MIG, bin-packing)
    • Autoscaling and workload orchestration
    • Helm-based deployments and multi-tenant environments
  • Help customers balance:
    • Raw Kubernetes flexibility vs platform abstraction (Lightning)

Demos, POCs, and Execution

  • Build and deliver technical demos and POCs that showcase:
    • Distributed training workflows
    • Scalable inference endpoints
    • End-to-end ML pipelines on Lightning AI
  • Scope and lead POCs aligned to customer success metrics (latency, cost, reliability)

Cross-Functional Impact

  • Act as the bridge between customers, product, and engineering
  • Provide feedback on:
    • Platform gaps in infrastructure, orchestration, and performance
    • Emerging patterns in GPU usage and distributed systems
  • Influence roadmap across ML workflows and infrastructure capabilities

Enablement & Thought Leadership

  • Create technical content
  • Architecture guides (e.g., high-throughput LLM inference systems)
  • Best practices for GPU utilization and scaling
  • Educate customers on modern AI infrastructure patterns

 

What You’ll Need

ML Systems Expertise

  • 3–6 years experience in:
    • Machine Learning / AI Engineering
    • Solutions Engineering / Sales Engineering / ML Consulting
  • Strong understanding of:
    • Training vs inference workloads
    • Model optimization (quantization, batching, caching, etc.)

GPU & Distributed Systems

  • Experience working with:
    • GPU clusters (NVIDIA stack preferred)
    • Distributed training or inference systems
  • Familiarity with:
    • NCCL, CUDA, or GPU performance profiling
    • Networking concepts (RDMA, Roce, Infiniband, high-throughput systems)

Kubernetes & Cloud Platforms

  • Hands-on experience with:
    • Kubernetes (EKS, GKE, or on-prem)
    • Slurm 
    • Containerization (Docker)
  • Exposure to:
    • GPU scheduling in Kubernetes environments
    • Multi-tenant or production ML deployments

Programming & Tooling

  • Strong Python skills (PyTorch preferred)
  • Experience building:
    • ML pipelines
    • APIs or inference services
  • Familiarity with Lightning AI, PyTorch Lightning, or similar frameworks is a plus

Customer-Facing Excellence

  • Ability to:
    • Explain complex infrastructure and ML tradeoffs clearly
    • Run technical discovery and uncover quantifiable success metrics
  • Experience working cross-functionally with:
    • Sales, product, and engineering teams

 

Compensation

The annual base pay range for this role is $150,000 - $195,000, in addition to a variable pay component and meaningful equity. 

 

Benefits and Perks

We offer a comprehensive and competitive benefits package designed to support our employees’ health, well-being, and long-term success. Benefits may vary by location, team, and role.

Benefits include:

  • Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
  • Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
  • Generous paid time off, plus holidays
  • Paid parental leave
  • Professional development support
  • Wellness and work-from-home stipends
  • Flexible work environment

 

At Lightning AI, we are committed to fostering an inclusive and diverse workplace. We believe that diverse teams drive innovation and create better products. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic. We are dedicated to building a culture where everyone can thrive and contribute to their fullest potential.

Salary : $150,000 - $195,000

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Machine Learning Solutions Engineer (ML + Infrastructure Focus)?

Sign up to receive alerts about other jobs on the Machine Learning Solutions Engineer (ML + Infrastructure Focus) career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$97,257 - $120,701
Income Estimation: 
$123,167 - $152,295
Income Estimation: 
$123,167 - $152,295
Income Estimation: 
$146,673 - $180,130
Income Estimation: 
$101,387 - $124,118
Income Estimation: 
$119,030 - $151,900

Sign up to receive alerts about other jobs with skills like those required for the Machine Learning Solutions Engineer (ML + Infrastructure Focus).

Click the checkbox next to the jobs that you are interested in.

  • Architecture Skill

    • Income Estimation: $126,585 - $159,022
    • Income Estimation: $146,487 - $189,921
  • Building Codes and Regulations Skill

    • Income Estimation: $83,396 - $115,118
    • Income Estimation: $84,742 - $109,164
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at lightningai

  • lightningai York, NY
  • Who We Are Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying A... more
  • 1 Day Ago

  • lightningai York, NY
  • Who We Are Lightning AI is the company reimagining the way AI is built. After creating and releasing PyTorch Lightning in 2019, Lightning AI was launched t... more
  • 8 Days Ago

  • lightningai San Francisco, CA
  • Who We Are Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying A... more
  • 8 Days Ago

  • lightningai San Francisco, CA
  • Who We Are Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying A... more
  • 13 Days Ago


Not the job you're looking for? Here are some other Machine Learning Solutions Engineer (ML + Infrastructure Focus) jobs in the York, NY area that may be a better fit.

  • schrdinger York, NY
  • Schrödinger seeks a Machine Learning (ML) Platform Engineer to join us in our mission to improve human health and quality of life through the development, ... more
  • 14 Days Ago

  • biohub York, NY
  • Biohub is the first large-scale initiative bringing frontier AI models, massive compute, and frontier experimental capabilities under one roof. We're build... more
  • 15 Days Ago

AI Assistant is available now!

Feel free to start your new journey!