Back to Browse

Towards a Standardized Representation for Deep Learning Collective Algorithms

277 views
Oct 24, 2024
28:45

"Tushar Krishna (Associate Professor) - Georgia Institute Of Technology The surge of artificial intelligence- particularly large language models- has driven the rapid development of large-scale machine learning clusters. Executing distributed models on these clusters is often constrained by communication overhead- making efficient utilization of available network resources crucial. As a result- the routing algorithm employed for collective communications (i.e.- collective algorithms) plays a pivotal role in determining overall performance. Unfortunately- existing collective communication libraries for distributed machine learning (e.g.- NCCL- RCCL) are limited by a fixed set of basic collective algorithms. This limitation hinders communication optimization- especially in modern clusters with heterogeneous and asymmetric topologies. Furthermore- manually designing collective algorithms for all possible combinations of network topologies and collective patterns requires heavy engineering and validation efforts."

Download

0 formats

No download links available.

Towards a Standardized Representation for Deep Learning Collective Algorithms | NatokHD