What are the responsibilities and job description for the Cloud Engineer position at Surge Technology Solutions Inc?

Type: W2 or 1099........ (No C2C)

Visa: H1B, H4EAD, GCEAD, L2, OPT, CPT, Green Card, US Citizens (Only USA Applicants)

Workplace Type: Onsite - Chicago / Peoria -IL (Preferred), Dallas - TX , Nashville -TN

Experience: 8 Yr

Position’s Contributions to Work Group:

- This role is critical to the deployment, maintenance, and optimization of high-performance computing infrastructure, specifically leveraging NVIDIA’s advanced GPU technologies.

- You will work closely with AI researchers, data scientists, and software engineers to ensure our systems are robust, scalable, and tuned for cutting-edge machine learning workloads.

Typical task breakdown:

- Administer and maintain GPU-accelerated servers and clusters, including NVIDIA A100, H100, and other high-end GPU sets.

- Manage and optimize NVIDIA software stack components such as CUDA, cuDNN, TensorRT, NCCL, and NGC containers.

- Monitor system performance, troubleshoot hardware/software issues, and ensure high availability of AI infrastructure.

- Collaborate with DevOps and AI teams to support containerized workflows (Docker, Kubernetes) and distributed training environments.

- Implement security best practices and ensure compliance with internal and external standards.

- Lead upgrades, patching, and lifecycle management of GPU servers and related infrastructure.

- Provide documentation, automation scripts, and training for internal teams.

Work environment:

- Candidates must be able to go into office 1 day a week and eventually go into office 5 days a week when notified.

Education & Experience Required:

- Bachelor’s Degree with a minimum of 8 years work experience, 5 years of experience in server administration, with at least 3 years focused on NVIDIA GPU-based systems

Technical Skills:

- 5 years of experience in server administration, with at least 3 years

focused on NVIDIA GPU-based systems.

- Deep understanding of Linux system administration, especially in HPC or AI environments.

- Hands-on experience with NVIDIA GPU drivers, CUDA toolkit, and performance tuning.

- Familiarity with Slurm, Kubernetes, or other job scheduling and orchestration tools.

- Experience with monitoring tools (e.g., Prometheus, Grafana) and infrastructure automation (e.g., Ansible, Terraform).

- Strong scripting skills (Bash, Python, etc.).

- Excellent problem-solving and communication skills.

(Desired)

- NVIDIA Certified Professional or similar credentials.

- Experience with multi-GPU and multi-node training setups.

- Familiarity with AI/ML frameworks (e.g., PyTorch, TensorFlow) and their GPU dependencies.

- Exposure to cloud-based GPU infrastructure (AWS, Azure, GCP).

Disqualifiers/Red Flags:

- Choppy tenure/consistent job hopping.

- If the candidate cannot go into office 5days a week right now, most likely

they will not be able to later, so please do not submit them to this role unless they are able to work in office 5 days a week now.

Please forward your resume and contact details to sahithi_s@surgetechinc.com/ kaviya_t@surgetechinc.com or can call on 832-990-6448

Apply for this job

Receive alerts for other Cloud Engineer job openings

Cloud Engineer

What are the responsibilities and job description for the Cloud Engineer position at Surge Technology Solutions Inc?

What is the career path for a Cloud Engineer?

Job openings at Surge Technology Solutions Inc

Not the job you're looking for? Here are some other Cloud Engineer jobs in the Dallas, TX area that may be a better fit.

We don't have any other Cloud Engineer jobs in the Dallas, TX area right now.

AI Assistant is available now!