Back to Browse

Optimizing Training Workloads on GPU Clusters

576 views
Jun 18, 2025
1:00:43

The talk covers best practices, technical guidance and a live demonstration on a 2-node instant Kubernetes cluster. It will walk through key considerations from initial setup through to training execution and system monitoring. ​​Topics Covered: ​Pre-Cluster Planning: Choosing between Kubernetes and Slurm, sizing GPU resources, and understanding model and data requirements ​Pre-Flight Validation: Verifying hardware (GPUs, CPUs, memory), software stack (e.g., Docker), and network configuration for RDMA or Ethernet-based setups ​CPU and GPU Optimization: Understanding workload characteristics, NUMA node configuration, and avoiding common bottlenecks (e.g., CPU-heavy preprocessing) ​Storage and Data Handling: Comparing parallel file systems vs. local NVMe, managing data ingestion/output, and minimizing transfer overhead ​Failure Recovery and Observability: Addressing issues like GPU errors, node lockups, and network flaps, and implementing robust observability with tools like nvidia-smi and GPU utilization monitors ​Live Demo: Running a real training job with basic observability in place, and demonstrating progress checks and troubleshooting workflows

Download

1 formats

Video Formats

360pmp494.4 MB

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.

Optimizing Training Workloads on GPU Clusters | NatokHD