Back to Browse

NCCL and Libfabric: High-Performance Networking for Machine Learning

6.0K views
Mar 23, 2019
21:53

In this video from the 2019 OpenFabrics Workshop in Austin, Brian Barrett from Amazon presents: NCCL and Libfabric: High-Performance Networking for Machine Learning. "NCCL is a GPU-oriented collective communication library developed by NVIDIA to accelerate deep learning frameworks such as Caffe, MxNet, and TensorFlow. NCCL is topology aware, taking advantage of on-node networks as well as multiple internode network interfaces in a single node. NCCL 2 was recently made available under a BSD license on GitHub and includes provisions for adding support for net network stacks. In the fall of 2018, AWS open sourced a Libfabric driver for NCCL (https://github.com/aws/aws-ofi-nccl). This talk examines the design choices for mapping NCCL communication semantics on Libfabric, presents paths forward for supporting GPUDirect with Libfabric, and includes a discussion on how to grow the development community of the Libfabric driver for NCCL." Learn more: https://www.openfabrics.org/2019-workshop-agenda-and-abstracts/ Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

Download

0 formats

No download links available.

NCCL and Libfabric: High-Performance Networking for Machine Learning | NatokHD