What are the responsibilities and job description for the AI operation Support engineer position at Ket Software?
Role: AI operation Support engineer
Location: Malvern, PA
Job Description:
In this role, they will:
Monitor AI system and data pipeline health, performance, and availability using established monitoring tools and dashboards. Detect, triage, and resolve incidents affecting AI systems and their data infrastructure, coordinating with technical teams as needed. Implement proactive measures to prevent recurring issues and minimize service disruptions.
Perform routine operational tasks to maintain AI systems and data pipelines, including model updates, data refreshes, pipeline maintenance, and system patches. Implement scheduled maintenance activities with minimal service disruption. Manage user access and permissions for AI platforms according to security policies.
Analyze AI system and data pipeline performance metrics, identify bottlenecks and inefficiencies, and implement optimizations to improve response times, data flow, accuracy, and resource utilization. Monitor for model drift and data quality issues, coordinating retraining or pipeline adjustments when necessary.
Create and maintain comprehensive operational documentation, including runbooks, standard operating procedures, and knowledge base articles. Document system configurations, data pipeline dependencies, and recovery procedures to ensure operational continuity.
Identify opportunities for process improvement and automation in AI operations. Develop and implement scripts and workflows to automate routine tasks, reducing manual effort and minimizing human error. Contribute to the evolution of MLOps practices based on operational experience and emerging best practices.
Contribute strategically to the evaluation of infrastructure cloud readiness.
Maintain comprehensive technical documentation
Collaborate with engineering teams to develop and drive platform improvements.
Provide intermediate level technical support for midsize to large projects.