What are the responsibilities and job description for the AI Platform Specialists position at Adam Information Technologies LLC?
Job Details
AI Service Hosting/AI Platform Specialists
Hybrid
Atlanta, GA
6-12 Months
AI Platform Specialists
We are building a new team of platform specialists to support and enhance high-performance AI services. These are highly technical, hands-on roles focused on customer, application, and platform support of AI-focused workloads.
As an AI Platform Specialist, these roles will provide application and GPU support. The team will deliver Tier 1 and Tier 2 support to developers and engineers while collaborating closely with Tier 3 and 4 platform teams and vendors for issue resolution. The roles require user knowledge of Kubernetes, virtualization, and cloud-native technologies as well as operator knowledge of GPUs and other AI supporting services. Each specialist should have a focus on customer service along with goals of reliability, scalability, and performance.
Key Responsibilities
• Platform Support & Incident Response
o Provide Tier 1 & Tier 2 support for AI-driven applications and workloads.
o Troubleshoot and resolve issues related to Kubernetes deployments, GPU utilization, and service performance.
o Collaborate with Tier 3 teams, including Kubernetes engineers and external vendors, to escalate and resolve complex issues.
• Kubernetes & Cloud-Native Operations
o Full adoption, creation, and integrations into automated services using Helm, Ansible, Terraform, etc.
o Deploy, manage, and support containerized AI workloads on Google Anthos-powered Kubernetes clusters.
o Ensure adherence to pod security policies, automated rollouts/rollbacks, and best practices for scalable and secure Kubernetes environments.
• GPU Infrastructure & AI Services Management
o Optimize and support GPU-enabled workloads including CUDA and other AI acceleration frameworks.
o Assist in the installation, configuration, and support of AI coding assistants (e.g., Codeium).
• Observability & Documentation
o Maintain detailed operational documentation, runbooks, and troubleshooting guides.
o Utilize monitoring/logging tools like New Relic, Big Panda, Prometheus, Grafana, and other observability frameworks.
• Process Improvement & Collaboration
o Work cross-functionally with developers, IT teams, and vendors to ensure seamless deployment and support of AI services.
o Contribute to CI/CD pipelines, automation, service, and security best practices.
o Track and communicate work through task management platforms (ServiceNow and Jira).
Required Skills & Experience
• Hybrid Cloud – In-depth knowledge of private (on-premises) and public (Google Cloud Platform & AWS) cloud architectures and services.
• AI/ML Software – Developer experience with DevOps practices (Git, Jenkins, etc.) as well as working with AI/ML engineers and data scientists.
• AI/ML Hardware – Experience deploying, supporting, and optimizing on-premises and cloud GPUs (NVIDIA & AMD) enabled infrastructure (VMs & Containers).
• Kubernetes Expertise – Hands-on experience with deploying and managing containerized workloads in Kubernetes.
• Technical Support & Troubleshooting – Proven ability to diagnose and resolve customer and platform issues in production environments.
• Strong Communication & Documentation – Ability to clearly document procedures, write knowledge base articles, and collaborate with customers and teams.
• Time Management & Accountability – Ability to work independently, prioritize tasks, and manage workload effectively.
Preferred Qualifications
• Experience with GPU orchestration tools like Run:AI, NVIDIA AI Enterprise, VMWare Private AI Foundation, etc.
• Exposure to AI coding assistants like Codeium, Copilot, or Tabnine.
• Proficient in development tools like Python, PyTorch, TensorFlow, Jupyter Notebooks, etc.