What are the responsibilities and job description for the AI/ML Cloud Engineer/ MLOps (Azure) position at 1 Point System?
AI/ML Cloud Engineer/ MLOps :: Bloomfield , CT
Reallocation : OK
Must have: This project mainly focuses on deploying AI/ML models and tracking their usage. The team is using a tool called LiteLLM to measure how the models are being used.
You don’t need deep expertise in model training, but you should have basic understanding of how models are trained.
Key Skills Required:
- Strong experience in MLOps (especially for model deployment)
- Experience with LLM Gateways (this is a plus, as they are starting to use it)
What You Will Do:
Design, deploy, and manage cloud infrastructure for AI/ML workloads on AWS and Azure
Work on AI platforms like:
Amazon SageMaker
Azure Machine Learning
Support model training and deployment environments
Help data scientists and ML engineers by setting up and optimizing infrastructure for:
Model training
Model deployment (inference)
Key Responsibilities
Cloud Infrastructure Management
- Design, deploy, and manage cloud infrastructure supporting AI/ML workloads on AWS and Azure.
- Manage compute resources such as EC2, Azure Virtual Machines, GPU instances, and Kubernetes clusters.
- Provision and configure storage, networking, and security services for AI platforms.
- Ensure high availability, scalability, and reliability of AI environments.
- AI Platform Support
- Deploy and maintain AI/ML services such as:
- Amazon SageMaker
- Azure Machine Learning
- AI model training and inference environments
- Support data scientists and ML engineers by providing optimized infrastructure for model training and deployment.
- Automation & Infrastructure as Code
- Implement Infrastructure as Code (IaC) using tools such as:
- Terraform
- CloudFormation
- ARM templates / Bicep
- Automate environment provisioning, patching, and scaling.
- Containerization & Orchestration
- Deploy and manage containerized AI workloads using:
- Docker
- Kubernetes
- Amazon EKS
- Azure Kubernetes Service (AKS)
- Monitoring & Performance Optimization
- Monitor system health, performance, and resource utilization using tools like:
- CloudWatch
- Azure Monitor
- Datadog / Prometheus
- Optimize infrastructure for cost, performance, and GPU utilization.
- Security & Compliance
- Implement cloud security best practices including:
- IAM / RBAC management
- Network security groups
- Encryption and secrets management
- Ensure compliance with organizational and regulatory standards.
- CI/CD & DevOps Integration
- Integrate AI infrastructure with CI/CD pipelines.
- Support automated deployment of models and AI services.
Required Qualifications
- Bachelor’s degree in Computer Science, Information Systems, or related field.
- 5 years experience in infrastructure administration or cloud engineering.
- Strong hands-on experience with:
- AWS cloud services
- Microsoft Azure cloud services
- Experience supporting AI/ML infrastructure or data platforms.
- Proficiency with Linux administration and scripting (Python, Bash, PowerShell).
- Experience with Docker and Kubernetes.
Preferred Qualifications
- Experience with GPU infrastructure for AI workloads.
- Knowledge of ML pipelines and MLOps practices.
- Experience with data platforms (Snowflake, Databricks, or Spark).
- Familiarity with AI frameworks such as TensorFlow or PyTorch.
- Cloud certifications such as:
- AWS Certified Solutions Architect
- Azure Administrator or Azure AI Engineer
Key Skills
- Cloud Infrastructure (AWS, Azure)
- AI/ML Platform Support
- Kubernetes / Containers
- Infrastructure Automation
- Monitoring & Performance Tuning
- Security & Compliance
- DevOps & CI/CD