NCCL requires at least CUDA 7.0 and Kepler or newer GPUs. Best performance is achieved when all GPUs are located on a common PCIe root complex, but multi-socket configurations are also supported.
Note: NCCL may also work with CUDA 6.5, but this is an untested configuration.
Build & run
To build the library and tests.$ cd nccl
$ make CUDA_HOME=<cuda install path> test
Test binaries are located in the subdirectories nccl/build/test/{single,mpi}.
$ ~/git/nccl$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
$ ~/git/nccl$ ./build/test/single/all_reduce_test 100000000
# Using devices
# Rank 0 uses device 0 [0x04] GeForce GTX 1080 Ti
# Rank 1 uses device 1 [0x05] GeForce GTX 1080 Ti
# Rank 2 uses device 2 [0x08] GeForce GTX 1080 Ti
# Rank 3 uses device 3 [0x09] GeForce GTX 1080 Ti
# Rank 4 uses device 4 [0x83] GeForce GTX 1080 Ti
# Rank 5 uses device 5 [0x84] GeForce GTX 1080 Ti
# Rank 6 uses device 6 [0x87] GeForce GTX 1080 Ti
# Rank 7 uses device 7 [0x88] GeForce GTX 1080 Ti
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
100000000 100000000 char sum 30.244 3.31 5.79 0e+00 29.892 3.35 5.85 0e+00
100000000 100000000 char prod 30.493 3.28 5.74 0e+00 30.524 3.28 5.73 0e+00
100000000 100000000 char max 29.745 3.36 5.88 0e+00 29.877 3.35 5.86 0e+00
100000000 100000000 char min 29.744 3.36 5.88 0e+00 29.868 3.35 5.86 0e+00
100000000 25000000 int sum 29.692 3.37 5.89 0e+00 29.754 3.36 5.88 0e+00
100000000 25000000 int prod 30.733 3.25 5.69 0e+00 30.697 3.26 5.70 0e+00
100000000 25000000 int max 29.871 3.35 5.86 0e+00 29.700 3.37 5.89 0e+00
100000000 25000000 int min 29.809 3.35 5.87 0e+00 29.852 3.35 5.86 0e+00
100000000 50000000 half sum 28.590 3.50 6.12 1e-02 27.545 3.63 6.35 1e-02
100000000 50000000 half prod 27.416 3.65 6.38 1e-03 27.375 3.65 6.39 1e-03
100000000 50000000 half max 30.811 3.25 5.68 0e+00 30.670 3.26 5.71 0e+00
100000000 50000000 half min 30.818 3.24 5.68 0e+00 30.931 3.23 5.66 0e+00
100000000 25000000 float sum 29.719 3.36 5.89 1e-06 29.750 3.36 5.88 1e-06
100000000 25000000 float prod 29.741 3.36 5.88 1e-07 30.029 3.33 5.83 1e-07
100000000 25000000 float max 28.400 3.52 6.16 0e+00 28.400 3.52 6.16 0e+00
100000000 25000000 float min 28.364 3.53 6.17 0e+00 28.434 3.52 6.15 0e+00
100000000 12500000 double sum 33.989 2.94 5.15 0e+00 34.104 2.93 5.13 0e+00
100000000 12500000 double prod 33.895 2.95 5.16 2e-16 33.833 2.96 5.17 2e-16
100000000 12500000 double max 30.228 3.31 5.79 0e+00 30.273 3.30 5.78 0e+00
100000000 12500000 double min 30.324 3.30 5.77 0e+00 30.341 3.30 5.77 0e+00
100000000 12500000 int64 sum 29.914 3.34 5.85 0e+00 30.036 3.33 5.83 0e+00
100000000 12500000 int64 prod 30.975 3.23 5.65 0e+00 31.083 3.22 5.63 0e+00
100000000 12500000 int64 max 29.954 3.34 5.84 0e+00 29.949 3.34 5.84 0e+00
100000000 12500000 int64 min 29.946 3.34 5.84 0e+00 29.952 3.34 5.84 0e+00
100000000 12500000 uint64 sum 29.981 3.34 5.84 0e+00 30.100 3.32 5.81 0e+00
100000000 12500000 uint64 prod 30.911 3.24 5.66 0e+00 30.800 3.25 5.68 0e+00
100000000 12500000 uint64 max 29.890 3.35 5.85 0e+00 29.947 3.34 5.84 0e+00
100000000 12500000 uint64 min 29.929 3.34 5.85 0e+00 29.964 3.34 5.84 0e+00
Out of bounds values : 0 OK
Avg bus bandwidth : 5.81761