What are the responsibilities and job description for the MLOps Platform Engineer position at Scale.jobs?
About The Role
The role owns the infrastructure that makes machine learning models reliable in production: deployment pipelines, experiment tracking, monitoring, and the Kubernetes and cloud infrastructure underneath.
This is a role where infrastructure and data science intersect. You will work closely with both applied scientists and platform engineers, and you will be judged on the reliability and developer experience of the systems you build.
Key Responsibilities
Seattle, WA (Hybrid)
The role owns the infrastructure that makes machine learning models reliable in production: deployment pipelines, experiment tracking, monitoring, and the Kubernetes and cloud infrastructure underneath.
This is a role where infrastructure and data science intersect. You will work closely with both applied scientists and platform engineers, and you will be judged on the reliability and developer experience of the systems you build.
Key Responsibilities
- Build and maintain end-to-end ML pipelines using Kubeflow, Airflow, or Prefect - from data ingestion through training, validation, and deployment
- Manage Kubernetes clusters hosting model training and serving workloads; optimize resource utilization, autoscaling, and reliability
- Operate and extend MLflow for experiment tracking, model registry, and artifact management across multiple teams
- Design and implement feature stores (Feast or equivalent) to ensure consistent feature access across training and real-time serving
- Build CI/CD pipelines for automated model testing, validation gates, and progressive deployment strategies
- Develop monitoring dashboards for model performance, data drift, and infrastructure health; configure alerting and incident response workflows
- Write IaC using Terraform; maintain documentation of platform architecture and runbooks for production operations
- 2–6 years of DevOps, platform engineering, or MLOps experience - with specific exposure to ML workloads in production
- Strong Kubernetes knowledge: cluster management, operators, resource quotas, node affinity, persistent volumes
- Hands-on Terraform experience for cloud infrastructure provisioning on AWS, GCP, or Azure
- Familiarity with ML workflow orchestration: MLflow, Kubeflow Pipelines, Airflow, or Prefect
- Python proficiency for pipeline scripting and automation; comfort with Docker and container registries
- Understanding of cloud ML services: SageMaker, Vertex AI, or Azure ML
- Bonus: experience with Spark or Databricks, Prometheus/Grafana stack, service mesh (Istio), or Seldon/KServe for model serving
Seattle, WA (Hybrid)
- San Francisco Bay Area
- New York City
- Austin, TX