Demo

Sr Director of Software Engineering- AI Infrastructure Platform

hackajob
Palo Alto, CA Full Time
POSTED ON 5/16/2026
AVAILABLE BEFORE 6/14/2026
hackajob is collaborating with J.P. Morgan to connect them with exceptional professionals for this role.

Job Description

Your opportunity to make a real impact and shape the future of financial services is waiting for you. Letâs push the boundaries of what's possible together.

As a Senior Director of Software Engineering at JPMorganChase within the firmwide AI Infrastructure Platform organization, you will lead multiple technical areas and manage the activities of multiple departments responsible for delivering a unified AI infrastructure layer across onâpremises environments, public cloud, and emerging acceleratedâcompute vendors. You will collaborate across AI/ML engineering, infrastructure, security and controls, and vendor teams to ensure the firm remains at the forefront of AI platform capabilities, operational excellence, and industry best practices.

In this role, you will own training and experimentation on a Kubernetesâstandardized platform. While a dedicated architecture function exists, you will act as an active design partnerâguiding architectural tradeâoffs and ensuring designs translate into reliable, secure, and operable systems at enterprise scale.

Job Responsibilities

  • Lead multiple technology and platform implementations across departments to deliver firmwide AI infrastructure objectives, with a primary focus on training and experimentation platforms operating at enterprise scale.
  • Own the design, delivery, and evolution of a Kubernetesâfirst training and experimentation platform, including Kubernetesânative support for batch and distributed training jobs, lifecycle management, retry semantics, and failure recovery patterns.
  • Standardize AI developer workflows for experimentation, enabling selfâservice job submission, reusable templates and golden paths, reproducibility mechanisms, and consistent runtime behavior across hybrid deployment environments.
  • Build and evolve platform APIs and automation, including Kubernetes controllers and operators where appropriate, to ensure the platform is safe, scalable, and easy to adopt across teams.
  • Drive measurable improvements in GPU availability and utilization through reliability engineering, fleet readiness patterns, and accelerated capacity onboarding.
  • Define and implement governanceâbased scheduling and placement strategies, including:

Multiâtenant GPU quotas and guardrails,

Priority, admission control, and reservation patterns,

Preemption policies,

Fragmentation reduction and topologyâaware placement (GPU type, MIG, and topology awareness)

  • Embed enterpriseâgrade security, risk, and control requirements into platform defaults, including IAM and RBAC controls, secrets management, audit logging, policy enforcement, network segmentation, and controlled change management.
  • Drive operational excellence by establishing SLIs and SLOs, managing error budgets, leading incident management practices, forecasting capacity, and delivering endâtoâend platform observability across job lifecycles and GPU telemetry.
  • Act as the primary interface with senior leaders, stakeholders, and executives, driving alignment and consensus across competing priorities and complex initiatives.
  • Lead multiple engineering teams and managers, building a highâperforming organization with strong engineering standards, scalable operating models, and a culture of accountability and continuous improvement.
  • Champion the firmâs culture of diversity, opportunity, inclusion, and respect.

Required Qualifications, Capabilities, And Skills

  • 15 years of engineering experience, including 8 years of senior engineering leadership experience with responsibility for managing managers.
  • Demonstrated experience delivering platform products (beyond foundational infrastructure) with strong adoption, reliability, and operational maturity.
  • Experience developing and leading large, crossâfunctional engineering teams within highly matrixed and complex enterprise environments.
  • Proven track record of leading complex initiatives supporting distributed system design, testing, and operational stability at scale.
  • Deep handsâon expertise with Kubernetesâbased platforms, including:

Multiâtenancy, RBAC, admission control, and network policy,

Multiâcluster operations, upgrades, and cluster lifecycle management,

Controllers, operators (CRDs), and platform API design patterns

  • Experience supporting AI training and experimentation platforms, including:

PyTorch and distributed training concepts such as scaling, orchestration, and failure modes,

Ray or similar frameworks for distributed experimentation execution,

Familiarity with Slurm or equivalent HPC or batch schedulers and core concepts such as queues, fairâshare, reservations, and preemption

  • Understanding of modern AI inference stacks (for example, vLLM) and how serving constraintsâlatency, throughput, batching, KV cache behavior, and GPU memory limitsâinfluence training and experimentation platform design.
  • Strong understanding of GPU infrastructure fundamentals, including NVIDIA ecosystem capabilities, health and telemetry signals, and scheduling and placement constraints.
  • Extensive practical experience with cloudânative technologies and hybrid infrastructure environments spanning onâpremises and public cloud.
  • Experience hiring, developing, coaching, and retaining highâperforming engineering talent.

Preferred Qualifications, Capabilities, And Skills

  • Experience operating largeâscale GPU fleets, including heterogeneous accelerator environments.
  • Experience delivering hybrid AI platforms across onâpremises infrastructure, public cloud, and specialized acceleratedâcompute vendors.
  • Experience working at the code level within largeâscale distributed systems.
  • This position is subject to Section 19 of the Federal Deposit Insurance Act. As such, an employment offer for this position is contingent on JPMorganChaseâs review of criminal conviction history, including pretrial diversions or program entries.

ABOUT US

JPMorganChase, one of the oldest financial institutions, offers innovative financial solutions to millions of consumers, small businesses and many of the worldâs most prominent corporate, institutional and government clients under the J.P. Morgan and Chase brands. Our history spans over 200 years and today we are a leader in investment banking, consumer and small business banking, commercial banking, financial transaction processing and asset management.

We offer a competitive total rewards package including base salary determined based on the role, experience, skill set and location. Those in eligible roles may receive commission-based pay and/or discretionary incentive compensation, paid in the form of cash and/or forfeitable equity, awarded in recognition of individual achievements and contributions. We also offer a range of benefits and programs to meet employee needs, based on eligibility. These benefits include comprehensive health care coverage, on-site health and wellness centers, a retirement savings plan, backup childcare, tuition reimbursement, mental health support, financial coaching and more. Additional details about total compensation and benefits will be provided during the hiring process.

We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicantsâ and employeesâ religious practices and beliefs, as well as mental health or physical disability needs. Visit our FAQs for more information about requesting an accommodation.

JPMorgan Chase & Co. is an Equal Opportunity Employer, including Disability/Veterans

About The Team

Our Global Technology Infrastructure group is a team of innovators who love technology as much as you do. Together, youâll use a disciplined, innovative and a business focused approach to develop a wide variety of high-quality products and solutions. Youâll work in a stable, resilient and secure operating environment where youâand the products you deliverâwill thrive.

High Risk Roles (HRR) are sensitive roles within the technology organization that require high assurance of the integrity of staff by virtue of 1) sensitive cybersecurity and technology functions they perform within systems or 2) information they receive regarding sensitive cybersecurity or technology matters. Users in these roles are subject to enhanced pre-hire screening which includes both criminal and credit background checks (as allowed by law). The enhanced screening will need to be successfully completed prior to commencing employment or assignment.

Salary : $232,750 - $325,000

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Sr Director of Software Engineering- AI Infrastructure Platform?

Sign up to receive alerts about other jobs on the Sr Director of Software Engineering- AI Infrastructure Platform career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$265,326 - $360,661
Income Estimation: 
$285,506 - $437,106
Income Estimation: 
$295,474 - $472,927
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at hackajob

  • hackajob Hanover, NH
  • hackajob is collaborating with Lockheed Martin - US to connect them with exceptional professionals for this role. Description This position may be eligible... more
  • 16 Days Ago

  • hackajob Stratford, CT
  • hackajob is collaborating with Lockheed Martin - US to connect them with exceptional professionals for this role. Description: WHO WE ARE Lockheed Martin i... more
  • 16 Days Ago

  • hackajob Fort Meade, MD
  • hackajob is collaborating with MANTECH to connect them with exceptional professionals for this role. MANTECH seeks an experienced and passionate, career an... more
  • 16 Days Ago

  • hackajob Fort Meade, MD
  • hackajob is collaborating with MANTECH to connect them with exceptional professionals for this role. MANTECH seeks an experienced and passionate, career an... more
  • 16 Days Ago


Not the job you're looking for? Here are some other Sr Director of Software Engineering- AI Infrastructure Platform jobs in the Palo Alto, CA area that may be a better fit.

  • JPMorganChase Palo Alto, CA
  • Job Description Your opportunity to make a real impact and shape the future of financial services is waiting for you. Let’s push the boundaries of what's p... more
  • 13 Days Ago

  • JPMorgan Chase Palo Alto, CA
  • Your opportunity to make a real impact and shape the future of financial services is waiting for you. Let’s push the boundaries of what's possible together... more
  • 1 Month Ago

AI Assistant is available now!

Feel free to start your new journey!