What are the responsibilities and job description for the Cloud Native Kubernetes Lead position at IT Technical Jobs?
We are seeking a
Site Reliability Engineer (SRE)
to ensure the
stability, scalability, and reliability of cloud-native production platforms
. In this role, you will support
Kubernetes-based platforms and cloud-native storage systems
, working closely with platform and application teams to maintain
high availability, performance, and operational excellence
.
You will play a critical role in operating large-scale distributed systems, improving reliability through automation, and contributing to a culture of continuous improvement.
Key ResponsibilitiesReliability & Operations
- Proactively identify, troubleshoot, and resolve system issues to meet defined
Service Level Agreements (SLAs)
. - Ensure the
availability, reliability, and performance
of cloud-native production environments. - Participate in on-call and shift-based operations to provide
24/7 global support
.
Collaboration & Incident Management
- Collaborate closely with
platform and engineering teams
to diagnose and resolve complex infrastructure and application issues. - Act as a core member of the
platform team
, partnering with application teams to ensure seamless integration and uptime. - Support incident response, root cause analysis, and post-incident reviews to prevent recurrence.
Platform Engineering & Optimization
- Operate and optimize
cloud-native Kubernetes platforms
and
software-defined storage systems
. - Continuously improve platform
scalability, resilience, and high availability
. - Support service mesh technologies to enhance observability, security, and traffic management.
Automation & Tooling
- Develop and maintain
runbooks
to standardize operational processes and improve response efficiency. - Build tools and automation using
shell scripting and Python
to improve operational efficiency and empower L2 support engineers. - Drive automation to reduce manual intervention and improve system reliability.
Global Team Collaboration
- Work within a
globally distributed team
across multiple regions and time zones. - Collaborate effectively in a shift-based operational model to ensure seamless handovers and continuous service coverage.
Required Skills & ExperienceCore Technical Skills
- Strong
Linux system administration
experience with excellent troubleshooting skills. - Hands-on experience operating
Kubernetes clusters
in production environments. - Strong understanding of
cloud infrastructure fundamentals
, including compute, storage, networking, and automation. - Experience with
software-defined storage (SDS)
solutions. - Experience with
virtualization technologies
.
Cloud-Native & Networking
- Experience with
service mesh technologies
such as
Istio and Envoy
. - Familiarity with observability concepts (metrics, logs, tracing) and platform reliability best practices.
Automation & Scripting
- Demonstrable scripting and automation skills using
Bash, Python
, or similar languages. - Experience building tools to streamline operational workflows.
Certifications (Preferred)
- Certified Kubernetes Administrator (
CKA
) - Certified Kubernetes Security Specialist (
CKS/CNS
)
Professional Attributes
- Self-motivated and able to learn quickly in a fast-paced environment.
- Strong sense of ownership with the ability to deliver results with minimal supervision.
- Passion for identifying and adopting
new cloud technologies, tools, and processes
to drive innovation and business value. - Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.
Ways of Working (Our Principles)
We value engineers who:
- Continuously improve systems and processes through
automation and learning - Take a
professional, accountable approach
to reliability and operations - Apply
hypothesis-driven problem solving
to complex technical challenges - Prioritize
customer and user experience
through reliable services - Act with
urgency and ownership
, balancing speed with operational excellence