What are the responsibilities and job description for the Lead Engineer/SRE, KMS - AdTech Leader position at Andiamo?
Lead Site Reliability Engineer
This position offers the opportunity to guide the reliability and performance of large scale, customer facing systems. You will help create the services, automation, and architectural patterns that allow engineering teams to move quickly with confidence. The work focuses on treating operations as a software problem, building systems that are resilient by design, and partnering with product teams to ensure they can deliver reliable features at speed.
About The Role
You will take ownership of key reliability initiatives, shaping the technical vision for the systems under your care. Your work will support the continuous evolution of backend services and development workflows, helping teams release and operate their software smoothly. This role is ideal for someone who enjoys complex distributed systems, performance engineering, and building tools that empower large engineering groups.
How You Will Make a Difference
About Andiamo
Talent Partners for the AI Revolution. As a globally recognized staffing and consulting firm, we specialize in placing the top 2% of technology and go-to-market professionals with the world’s largest and most well-known companies.
For over 20 years, we've maintained the status of tier-one vendor for firms such as Palantir, Amazon, Fluidstack, Bloomberg, Relativity Space, Firefly, MasterCard, Visa, Two Sigma, Citadel, as well as other major financial services firms, elite hedge funds, Google-backed tech start-ups, and major software firms.
Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at www.andiamogo.com
This position offers the opportunity to guide the reliability and performance of large scale, customer facing systems. You will help create the services, automation, and architectural patterns that allow engineering teams to move quickly with confidence. The work focuses on treating operations as a software problem, building systems that are resilient by design, and partnering with product teams to ensure they can deliver reliable features at speed.
About The Role
You will take ownership of key reliability initiatives, shaping the technical vision for the systems under your care. Your work will support the continuous evolution of backend services and development workflows, helping teams release and operate their software smoothly. This role is ideal for someone who enjoys complex distributed systems, performance engineering, and building tools that empower large engineering groups.
How You Will Make a Difference
- Deliver foundational services that support rapid and predictable software delivery across the engineering organization.
- Create systems and operational processes that support reliable and scalable applications.
- Identify upstream solutions that prevent recurring issues and promote long term stability.
- Develop the technical roadmap for your area, collaborating with stakeholders to solve meaningful engineering challenges.
- Improve throughput and system performance by analyzing and eliminating architectural bottlenecks.
- Work with tools and technologies such as Python, AWS, Django, Kubernetes, Bash, Terraform, MySQL, Redis, and Postgres.
- Help foster a culture of strong engineering practices through thoughtful design discussions and collaborative whiteboarding sessions.
- Support and mentor engineers across the company, helping raise the standard of engineering quality and operational excellence.
- Write and maintain software that improves the reliability, performance, and efficiency of platform services.
- Participate in on call rotations with a focus on resolving issues at the source and reducing alert fatigue.
- Introduce architectural changes that significantly improve the scalability and resilience of critical systems.
- Work closely with product oriented engineers and other SREs to deliver improvements that have real customer impact.
- Use data driven analysis to understand system behavior, predict scaling needs, and guide strategic improvements.
- Promote site reliability principles across the engineering organization.
- Ten or more years of experience in site reliability engineering, devops, or related fields.
- Degree in computer science or a related field, or equivalent hands on experience.
- Calm and focused during outages with the ability to drive investigations to clear root cause and long term corrective measures.
- Strong understanding of Linux systems and the full networking stack.
- Experience collaborating with engineering teams to build and operate production software.
- Proficiency writing code using best practices in languages such as Python, Ruby, or Go.
- Genuine interest in exploring emerging AI tools and responsibly experimenting with techniques that improve engineering workflows.
About Andiamo
Talent Partners for the AI Revolution. As a globally recognized staffing and consulting firm, we specialize in placing the top 2% of technology and go-to-market professionals with the world’s largest and most well-known companies.
For over 20 years, we've maintained the status of tier-one vendor for firms such as Palantir, Amazon, Fluidstack, Bloomberg, Relativity Space, Firefly, MasterCard, Visa, Two Sigma, Citadel, as well as other major financial services firms, elite hedge funds, Google-backed tech start-ups, and major software firms.
Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at www.andiamogo.com