What are the responsibilities and job description for the Site Reliability Engineer/ Platform Engineer position at Agile Fuel | World-class Dedicated Engineering Teams?
Our client is a fast-growing AI-driven technology company focused on building intelligent, automated solutions that transform how modern engineering teams work. They are committed to creating a development culture where speed, reliability, and data-driven decision-making are at the core. Their product leverages advanced analytics and AI to help organizations improve productivity, enhance visibility, and deliver software more efficiently.
They are seeking a hybrid Site Reliability Engineer / Platform Engineer with strong DevOps expertise and solid Python engineering skills. This person will design, build, and operate the next generation of their cloud infrastructure and internal developer platforms. The ideal candidate is passionate about automation, observability, reliability, and scalable system design. You will drive improvements across cloud architecture, CI/CD workflows, development tooling, and operational excellence — enabling the engineering organization to ship faster and more reliably.
If you thrive in a fast-moving, AI-native environment and enjoy building intelligent, highly automated platforms, this role is an excellent fit.
Responsibilities
They are seeking a hybrid Site Reliability Engineer / Platform Engineer with strong DevOps expertise and solid Python engineering skills. This person will design, build, and operate the next generation of their cloud infrastructure and internal developer platforms. The ideal candidate is passionate about automation, observability, reliability, and scalable system design. You will drive improvements across cloud architecture, CI/CD workflows, development tooling, and operational excellence — enabling the engineering organization to ship faster and more reliably.
If you thrive in a fast-moving, AI-native environment and enjoy building intelligent, highly automated platforms, this role is an excellent fit.
Responsibilities
- Design, build, and maintain highly reliable, scalable Azure infrastructure using Container Apps, ACR, managed databases, serverless components, and other PaaS services;
- Own and enhance CI/CD pipelines, deployment workflows, platform automation, and the full observability stack;
- Develop Python-based tooling and infrastructure to support a scalable, reliable AI-driven platform;
- Architect and maintain secure, fault-tolerant integrations with external systems (GitHub, Jira, Azure, Redis, Sentry, etc.);
- Build and operate monitoring, logging, alerting, and SLO/SLA frameworks to ensure reliability and performance;
- Partner with backend and data engineering teams to design a scalable infrastructure foundation for high-growth AI products;
- Continuously optimize cost efficiency, reliability, and deployment velocity;
- Scale AI infrastructure and support the transition to an AI-native engineering organization;
- Drive an AI-native culture by leveraging LLM-powered workflows and automation for speed and efficiency.
- 5 years in DevOps, SRE, Platform Engineering, or similar roles;
- Expert-level understanding of cloud infrastructure, ideally Azure, including container services, serverless patterns, networking, and identity;
- Strong Python software engineering ability — building platform tools, automation frameworks, or backend services;
- Hands-on experience with containerization, Docker, and cloud-native operational patterns;
- Strong understanding of external system integrations, how to design around them, and how to build reliable abstractions when they fail;
- Experience designing and operating production-grade pipelines, monitoring, alerting, and observability tools;
- Practical understanding of resilience engineering: retries, backoff, idempotency, state management, and failure modes;
- A bias toward automation: if something can be automated, you automate it;
- A startup mindset: ownership, speed, pragmatic decision-making, and willingness to wear multiple hats;
- Interest in and excitement about AI-native development workflows using tools like ChatGPT, GitHub Copilot, and automated pipeline orchestration;
- Upper-Intermediate English level.
- Experience with Bicep, Terraform or other IaC tools;
- Background supporting Python/Django or data pipelines;
- Familiarity with Celery, distributed queues, or event-driven systems;
- Experience working in SOC2-compliant or enterprise-grade environments;
- Experience building internal developer platforms (IDPs) or self-service infrastructure.
- People-oriented management without bureaucracy;
- Flexible schedule (≈ 3 hours overlap with ET);
- 15 working days of annual paid vacation;
- Paid sick-leaves;
- Friendly and engaging professional team;
- Opportunities for self-realization, career, and professional growth.