What are the responsibilities and job description for the IOC Systems Specialist position at Optomi?
IOC Systems Specialist
Onsite | M-F 8 hr shifts (rotating on call)
Optomi, in partnership with a leading AI cloud infrastructure organization, is seeking an IOC Systems Specialist to join their growing operations team in Fort Worth, TX. This role will provide Tier 2 operational support for high-performance computing (HPC) cloud environments focused on large-scale AI training and inference workloads. The ideal candidate will have hands-on experience supporting HPC infrastructure, Kubernetes environments, Slurm workload management, and enterprise storage platforms such as WEKA and VAST. This individual will play a key role in maintaining system stability, troubleshooting complex incidents, and supporting mission-critical infrastructure within a 24x7 IOC/NOC environment.
What the Right Candidate Will Enjoy:
- Working with cutting-edge AI and HPC infrastructure technologies!
- Supporting large-scale GPU cluster environments!
- Exposure to advanced Kubernetes, cloud, and storage technologies!
- Opportunities to contribute to operational improvements and automation initiatives!
- Joining a fast-growing organization focused on sustainable, renewable-powered AI infrastructure!
- Collaborative environment with strong technical leadership and growth opportunities!
What Type of Experience the Right Candidate Has:
- 2–5 years of experience supporting or operating HPC clusters in production environments
- Strong operational experience with WEKA and VAST storage platforms
- Hands-on experience with Kubernetes administration and troubleshooting
- Experience supporting Slurm workload manager environments
- Familiarity with HPC monitoring, observability, and alerting platforms
- Experience performing incident response and root cause analysis in complex systems
- Understanding of cloud platforms such as AWS, Azure, or GCP
- Knowledge of HPC networking and storage technologies, including InfiniBand and high-throughput interconnects
Responsibilities of the Right Candidate:
- Provide Tier 2 operational support for HPC cloud infrastructure environments
- Monitor, troubleshoot, and resolve incidents involving Kubernetes, Slurm, storage, networking, and cloud systems
- Serve as an escalation point for Tier 1 support teams
- Perform root cause analysis and coordinate with engineering teams on permanent resolutions
- Execute operational changes, upgrades, patching, and maintenance activities
- Maintain and improve operational documentation, runbooks, and knowledge base articles
- Support monitoring and observability tooling to proactively identify system issues
- Assist with operational readiness and production support for new HPC capabilities
- Mentor junior operations staff and support continuous service improvement initiatives
- Participate in on-call rotations and major incident response activities
Job Must Haves:
- Must have hands-on experience with WEKA and VAST storage environments
- 2–5 years supporting HPC clusters in production or IOC/NOC environments
- Working knowledge of Kubernetes
- Operational experience with Slurm workload manager
- Familiarity with HPC monitoring and observability tooling
- Experience with incident response and root cause analysis
- Understanding of AWS, Azure, or GCP cloud platforms
- Knowledge of HPC networking and storage infrastructure
- Ability to work onsite in Fort Worth on a rotating 12-hour shift schedule
Nice to Have Skills:
- Bare-metal Kubernetes experience
- Relevant certifications such as CKA/CKAD, RHCSA, Linux , ITIL, or Server
- Experience with GPU or HPC vendor technologies
- Experience supporting AI or large-scale compute environments
- Automation or scripting experience
Salary : $100,000 - $135,000