What are the responsibilities and job description for the Exciting Opportunity for HPC Cluster & Scheduler Management consultant in Fremont, CA/ Tualatin, OR position at Noblesoft Technologies?
Hi
Role- HPC consultant
Location- Fremont, CA/ Tualatin, OR
HPC Cluster & Scheduler Management
Core HPC Skills
Role- HPC consultant
Location- Fremont, CA/ Tualatin, OR
HPC Cluster & Scheduler Management
- Design, configure, tune, and optimize SLURM partitions, queues, QoS, and scheduling policies to maximize cluster utilization and workload efficiency.
- Perform in-depth analysis of job scheduling behavior, bottlenecks, and resource contention.
- Troubleshoot job failures, performance degradation, and scheduler-related issues in production HPC environments.
- Implement fair-share, backfill, reservations, and policy-driven scheduling as required.
- Lead HPC storage performance benchmarking using industry-standard tools (e.g., IOR, FIO, MDTest, IOzone).
- Analyze I/O patterns of HPC workloads and map them to appropriate storage architectures (parallel file systems, NVMe, Lustre, Spectrum Scale, etc.).
- Provide technical input for storage selection and procurement, including performance expectations, sizing, and cost-performance tradeoffs.
- Collaborate with vendors and internal teams during POCs and performance validation exercises.
- Build, install, configure, and maintain HPC applications, compilers, libraries, and scientific software stacks.
- Optimize application performance using MPI, OpenMP, GPU acceleration (where applicable), and tuned math libraries.
- Support multiple compiler toolchains (GCC, Intel, LLVM, NVIDIA HPC SDK, etc.).
- Implement and manage environment modules (Lmod) or similar software management frameworks.
- Conduct system-level performance tuning across compute, memory, network, and storage layers.
- Diagnose node-level issues involving CPU, GPU, interconnects (InfiniBand/Ethernet), and OS configurations.
- Create operational runbooks, performance baselines, and troubleshooting documentation.
- Support cluster upgrades, expansions, and hardware refresh activities.
- Work closely with application owners, researchers, and infrastructure teams to meet aggressive delivery timelines.
- Translate workload requirements into practical HPC configurations and optimizations.
- Provide clear technical guidance and recommendations to leadership and stakeholders.
Core HPC Skills
- 8 12 years of hands-on HPC engineering experience in production environments.
- Strong expertise with SLURM (configuration, tuning, troubleshooting).
- Solid understanding of Linux systems (RHEL/CentOS/Rocky/Alma preferred).
- Deep knowledge of HPC storage systems and I/O performance analysis.
- Proven experience building and optimizing HPC applications and libraries.
- MPI implementations (Open MPI, MPICH), OpenMP
- Compilers and toolchains (GCC, Intel, NVIDIA HPC SDK)
- Performance tools (perf, vtune, nvprof/nsys, IB diagnostics)
- Environment modules (Lmod), package managers (Spack preferred)
- Bash/Python scripting for automation and diagnostics
- Experience with GPU-based HPC workloads (NVIDIA CUDA, ROCm).
- Exposure to cloud-based HPC (Azure, AWS, GCP).
- Familiarity with parallel file systems such as Lustre or IBM Spectrum Scale.
- Vendor engagement experience for HPC hardware/storage evaluations.