In this video from the 2019 OpenFabrics Workshop in Austin, Brian Barrett from Amazon presents: NCCL and Libfabric: High-Performance Networking for Machine Learning.
"NCCL is a GPU-oriented collective communication library developed by NVIDIA to accelerate deep learning frameworks
such as Caffe, MxNet, and TensorFlow. NCCL is topology aware, taking advantage of on-node networks as well as
multiple internode network interfaces in a single node. NCCL 2 was recently made available under a BSD license on
GitHub and includes provisions for adding support for net network stacks. In the fall of 2018, AWS open sourced a
Libfabric driver for NCCL (https://github.com/aws/aws-ofi-nccl). This talk examines the design choices for mapping NCCL
communication semantics on Libfabric, presents paths forward for supporting GPUDirect with Libfabric, and includes a
discussion on how to grow the development community of the Libfabric driver for NCCL."
Learn more: https://www.openfabrics.org/2019-workshop-agenda-and-abstracts/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Download
0 formats
No download links available.
NCCL and Libfabric: High-Performance Networking for Machine Learning | NatokHD