What are the responsibilities and job description for the MLOps Engineer position at Scale.jobs?
About The Role
The role bridges the gap between machine learning research and production engineering, focusing on the infrastructure and automation required to scale ML models across the enterprise. This position is responsible for building robust MLOps platforms that support the entire model lifecycle—from automated training and experimentation to high-availability deployment and real-time monitoring.
The team operates at the intersection of DevOps and Data Science, ensuring that models are not just accurate, but also observable, reproducible, and scalable. The role involves architecting CI/CD pipelines specifically for ML, managing feature stores, and optimizing the resource utilization of GPU-accelerated workloads in a cloud-native environment.
Key Responsibilities
The role bridges the gap between machine learning research and production engineering, focusing on the infrastructure and automation required to scale ML models across the enterprise. This position is responsible for building robust MLOps platforms that support the entire model lifecycle—from automated training and experimentation to high-availability deployment and real-time monitoring.
The team operates at the intersection of DevOps and Data Science, ensuring that models are not just accurate, but also observable, reproducible, and scalable. The role involves architecting CI/CD pipelines specifically for ML, managing feature stores, and optimizing the resource utilization of GPU-accelerated workloads in a cloud-native environment.
Key Responsibilities
- Architect and maintain end-to-end MLOps pipelines using tools such as Kubeflow, MLflow, or TFX to automate model training and deployment
- Design and implement robust CI/CD workflows for machine learning that include automated unit testing for code, data validation, and model performance gating
- Manage and scale containerized ML workloads on Kubernetes, optimizing for latency and throughput in production inference environments
- Build and maintain a centralized feature store to ensure consistency between offline training and online serving data
- Develop automated monitoring and alerting systems to detect data drift, concept drift, and model service degradation using Prometheus or Grafana
- Collaborate with security and data governance teams to implement model versioning, lineage tracking, and access controls for sensitive datasets
- Optimize model inference performance through quantization, pruning, or hardware-specific acceleration (TensorRT, ONNX)
- 3–7 years of experience in DevOps, SRE, or Software Engineering, with at least 2 years focused specifically on ML infrastructure
- Advanced proficiency in Python and experience with containerization technologies like Docker and Kubernetes
- Hands-on experience with cloud infrastructure, specifically AWS (SageMaker, EKS) or GCP (Vertex AI, GKE)
- Strong understanding of the ML lifecycle: versioning data (DVC), tracking experiments (MLflow/Weights & Biases), and serving models (Seldon, BentoML, or TorchServe)
- Familiarity with Infrastructure as Code (IaC) tools such as Terraform or CloudFormation
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related quantitative field
- Bonus: Experience with LLM orchestration (LangChain/LlamaIndex) and GPU cluster management