What are the responsibilities and job description for the Cloud Engineer position at Surge Technology Solutions Inc?
Type: W2 or 1099........ (No C2C)
Visa: H1B, H4EAD, GCEAD, L2, OPT, CPT, Green Card, US Citizens (Only USA Applicants)
Workplace Type: Onsite - Chicago / Peoria -IL (Preferred), Dallas - TX , Nashville -TN
Experience: 8 Yr
Position’s Contributions to Work Group:
- This role is critical to the deployment, maintenance, and optimization of high-performance computing infrastructure, specifically leveraging NVIDIA’s advanced GPU technologies.
- You will work closely with AI researchers, data scientists, and software engineers to ensure our systems are robust, scalable, and tuned for cutting-edge machine learning workloads.
Typical task breakdown:
- Administer and maintain GPU-accelerated servers and clusters, including NVIDIA A100, H100, and other high-end GPU sets.
- Manage and optimize NVIDIA software stack components such as CUDA, cuDNN, TensorRT, NCCL, and NGC containers.
- Monitor system performance, troubleshoot hardware/software issues, and ensure high availability of AI infrastructure.
- Collaborate with DevOps and AI teams to support containerized workflows (Docker, Kubernetes) and distributed training environments.
- Implement security best practices and ensure compliance with internal and external standards.
- Lead upgrades, patching, and lifecycle management of GPU servers and related infrastructure.
- Provide documentation, automation scripts, and training for internal teams.
Work environment:
- Candidates must be able to go into office 1 day a week and eventually go into office 5 days a week when notified.
Education & Experience Required:
- Bachelor’s Degree with a minimum of 8 years work experience, 5 years of experience in server administration, with at least 3 years focused on NVIDIA GPU-based systems
Technical Skills:
- 5 years of experience in server administration, with at least 3 years
focused on NVIDIA GPU-based systems.
- Deep understanding of Linux system administration, especially in HPC or AI environments.
- Hands-on experience with NVIDIA GPU drivers, CUDA toolkit, and performance tuning.
- Familiarity with Slurm, Kubernetes, or other job scheduling and orchestration tools.
- Experience with monitoring tools (e.g., Prometheus, Grafana) and infrastructure automation (e.g., Ansible, Terraform).
- Strong scripting skills (Bash, Python, etc.).
- Excellent problem-solving and communication skills.
(Desired)
- NVIDIA Certified Professional or similar credentials.
- Experience with multi-GPU and multi-node training setups.
- Familiarity with AI/ML frameworks (e.g., PyTorch, TensorFlow) and their GPU dependencies.
- Exposure to cloud-based GPU infrastructure (AWS, Azure, GCP).
Disqualifiers/Red Flags:
- Choppy tenure/consistent job hopping.
- If the candidate cannot go into office 5days a week right now, most likely
they will not be able to later, so please do not submit them to this role unless they are able to work in office 5 days a week now.
Please forward your resume and contact details to sahithi_s@surgetechinc.com/ kaviya_t@surgetechinc.com or can call on 832-990-6448