What are the responsibilities and job description for the Hybrid || LLM Inference & GPU Systems Consultant || Charlotte, NC position at Technogen, Inc.?

TECHNOGEN, Inc. is a Proven Leader in providing full IT Services, Software Development and Solutions for 15 years.

TECHNOGEN is a Small & Woman Owned Minority Business with GSA Advantage Certification. We have offices in VA; MD & Offshore development centers in India. We have successfully executed 100 projects for clients ranging from small business and non-profits to Fortune 50 companies and federal, state and local agencies.

Description:

Local candidates preferred.

Role Overview:
We are seeking an AI Infrastructure Runtime Engineer to build and maintain large-scale on-prem LLM infrastructure. This is an enterprise private GenAI environment running on NVIDIA H200 GPU clusters and an OpenShift AI deployment ecosystem. You will manage production inference internally, including self-hosting open-source LLMs like Llama. We are focused exclusively on inferencing; this role involves no model training infrastructure or fine-tuning pipelines.

Key Responsibilities
NVIDIA GPU Runtime Optimization: Drive extreme runtime efficiency and optimization for the token generation pipeline. Specifically manage prefill/decode optimization and KV cache management.
Inference Serving: Deploy and manage inference engines including vLLM and TensorRT-LLM.
Hardware Utilization: Optimize GPU throughput tuning, batching strategies, and latency optimization. Manage workload orchestration using RunAI and Kubernetes GPU orchestration.
Model Lifecycle Management: Oversee the complete Hugging Face model lifecycle, including model onboarding, deployment, and retirement.
Platform Operations: Operate and maintain the OpenShift AI ecosystem as the primary container platform for GenAI workloads.

Required Qualifications
8 years experience working as an LLM Systems Engineer or AI Infrastructure Runtime Engineer.
8 years hands-on experience with NVIDIA H200 clusters and runtime optimization techniques (KV Cache, prefill/decode).
Proficiency in OpenShift AI and GPU orchestration tools like RunAI.
Strong experience with modern inference frameworks, specifically vLLM and TensorRT-LLM.
Proven track record managing the Hugging Face deployment lifecycle.
Must be onsite at client in Charlotte, NC at least 3 days/week

Apply for this job

Receive alerts for other Hybrid || LLM Inference & GPU Systems Consultant || Charlotte, NC job openings

Hybrid || LLM Inference & GPU Systems Consultant || Charlotte, NC

What are the responsibilities and job description for the Hybrid || LLM Inference & GPU Systems Consultant || Charlotte, NC position at Technogen, Inc.?

What is the career path for a Hybrid || LLM Inference & GPU Systems Consultant || Charlotte, NC?

Job openings at Technogen, Inc.

Not the job you're looking for? Here are some other Hybrid || LLM Inference & GPU Systems Consultant || Charlotte, NC jobs in the Charlotte, NC area that may be a better fit.

We don't have any other Hybrid || LLM Inference & GPU Systems Consultant || Charlotte, NC jobs in the Charlotte, NC area right now.

AI Assistant is available now!