What are the responsibilities and job description for the AI Engineer position at Oliver James?
Location: Remote (U.S.)
Employment Type: Contract through End of Year (W2 Only)
Overview
We are working with an asset management firm seeking seeking an experienced AI Engineer to support the operationalization, reliability, monitoring, and support of AI platforms and agent-based services within a large enterprise technology environment.
This role will focus on helping AI solutions move from development into production by building scalable operational processes, observability frameworks, automation capabilities, and support models that ensure reliability, performance, and long-term maintainability.
The ideal candidate will have a blend of AI platform experience, cloud engineering expertise, Site Reliability Engineering (SRE) principles, and a passion for building highly available production systems.
Responsibilities
AI Platform Reliability & Operations
Employment Type: Contract through End of Year (W2 Only)
Overview
We are working with an asset management firm seeking seeking an experienced AI Engineer to support the operationalization, reliability, monitoring, and support of AI platforms and agent-based services within a large enterprise technology environment.
This role will focus on helping AI solutions move from development into production by building scalable operational processes, observability frameworks, automation capabilities, and support models that ensure reliability, performance, and long-term maintainability.
The ideal candidate will have a blend of AI platform experience, cloud engineering expertise, Site Reliability Engineering (SRE) principles, and a passion for building highly available production systems.
Responsibilities
AI Platform Reliability & Operations
- Support the deployment and operationalization of AI platforms, AI agents, and generative AI services in production environments.
- Establish reliability standards focused on availability, scalability, resiliency, and performance.
- Develop operational runbooks, support procedures, and escalation processes for AI-driven platforms.
- Ensure production readiness for AI services and supporting infrastructure.
- Design and implement monitoring, alerting, and observability solutions across AI platforms and supporting infrastructure.
- Establish telemetry and logging capabilities across model inference, orchestration, compute, and data pipeline layers.
- Integrate AI platform monitoring into enterprise incident management and operational support frameworks.
- Perform root cause analysis and drive remediation efforts for platform incidents and recurring issues.
- Apply Site Reliability Engineering (SRE) best practices, including SLIs, SLOs, error budgets, and automation-first principles.
- Automate operational processes related to deployment, validation, monitoring, and recovery.
- Build tooling and operational improvements that enhance platform stability and scalability.
- Partner closely with Cloud Engineering, Enterprise Architecture, Data Engineering, Security, and Application Development teams.
- Ensure AI solutions align with enterprise security, compliance, governance, and risk management standards.
- Serve as a subject matter expert for AI platform operations and production support.
- Experience in Site Reliability Engineering (SRE), Platform Engineering, Production Engineering, or Infrastructure Operations environments.
- Hands-on experience supporting AI/ML platforms, generative AI solutions, AI agents, or data-intensive platforms in production.
- Strong understanding of monitoring, observability, incident management, and reliability engineering principles.
- Experience operating cloud environments such as Amazon Web Services, Microsoft Azure, or Google Cloud.
- Experience supporting containerized environments such as Kubernetes.
- Strong troubleshooting and root cause analysis experience within distributed systems.
- Experience supporting AI model inference platforms, AI pipelines, and agent-based architectures.
- Knowledge of MLOps practices and AI lifecycle management.
- Experience with enterprise monitoring, alerting, and logging platforms.
- Exposure to messaging systems, event-driven architectures, and data platforms.
- Experience operating within highly regulated enterprise environments.
- Hands-on experience with Microsoft Copilot, Copilot Studio, Copilot integrations, or supporting enterprise AI assistant platforms in production environments.
- Familiarity with enterprise AI governance, monitoring, and support practices related to generative AI solutions.
- Strong communicator who can work effectively across technical and business teams.
- Comfortable operating in ambiguous, fast-moving AI environments.
- Passionate about AI reliability, operational excellence, and automation.
- Able to balance strategic platform thinking with hands-on engineering and support responsibilities.