Optimizing Training Workloads on GPU Clusters

Name: Optimizing Training Workloads on GPU Clusters
Uploaded: Jun 18, 2025
Duration: 3643 s

Together AI4.41K subscribers

576 views

Jun 18, 2025

1:00:43

The talk covers best practices, technical guidance and a live demonstration on a 2-node instant Kubernetes cluster. It will walk through key considerations from initial setup through to training execution and system monitoring. Topics Covered: Pre-Cluster Planning: Choosing between Kubernetes and Slurm, sizing GPU resources, and understanding model and data requirements Pre-Flight Validation: Verifying hardware (GPUs, CPUs, memory), software stack (e.g., Docker), and network configuration for RDMA or Ethernet-based setups CPU and GPU Optimization: Understanding workload characteristics, NUMA node configuration, and avoiding common bottlenecks (e.g., CPU-heavy preprocessing) Storage and Data Handling: Comparing parallel file systems vs. local NVMe, managing data ingestion/output, and minimizing transfer overhead Failure Recovery and Observability: Addressing issues like GPU errors, node lockups, and network flaps, and implementing robust observability with tools like nvidia-smi and GPU utilization monitors Live Demo: Running a real training job with basic observability in place, and demonstrating progress checks and troubleshooting workflows

Download

1 formats

Video Formats

360pmp494.4 MB

Download

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.