Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance with NVLink

See original GitHub issue

Hi everyone, we started testing 2080ti with Quadro NVLink bridge. We have a machine with 4 2080ti GPU cards. We connected the cards 2 and 3 with the NVLink bridge, while the cards 0 and 1 were left without bridge connection. The output of nvlink status shows that everything is fine

nvidia-smi nvlink --status -i 2
GPU 2: GeForce RTX 2080 Ti (UUID: GPU-6c41afdf-d8aa-ed9b-24a2-fa90c471fda0)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s

nvidia-smi nvlink --status -i 3
GPU 3: GeForce RTX 2080 Ti (UUID: GPU-a78f4e04-05cc-74fe-5d3a-c2ad0f069e05)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s

I also checked p2p connectivity, its output below also seems to be fine. p2p_test.txt

Then I made slight modifications to the PyTorch ImageNet example and Apex ImageNet example, so that they use random input instead of ImageNet data.

I decided to run 4 tests:

Test 1, PyTorch DataParallel with ResNet18: python main.py dummy
Test 2, PyTorch DistributedDataParallel with ResNet18: python main.py dummy --dist-backend ‘nccl’ --multiprocessing-distributed --world-size 1 --rank 0 --dist-url ‘tcp://127.0.0.1:2504’ dummy
Test 3, Apex FP32 training: python main_amp.py -a resnet50 --b 64 --workers 4 --opt-level O0 dummy
Test 4, Apex FP16 training: python main1_amp.py -a resnet50 --b 64 --workers 4 --opt-level O2 dummy

As a result of time measurements for batch processing I took the output of a time per batch running average at the end of the first training epoch in each test. Tests with NVLink were run on cards 2 and 3, tests without NVLink - on cards 0 and 1. While for the Test 1 a batch processing time reduced with NVLink from 0.190 to 0.182 seconds and for the Test 2 from 0.172 to 0.166 seconds, for Test 3 NVLink connection increased a batch processing time from 0.249 to 0.278 seconds and for Test 4 from 0.172 to 0.202 seconds. Does anybody observed the similar performance drop with using NVLink and Apex together or could suggest more appropriate performance tests?

Issue Analytics

State:
Created 4 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

1reaction

SergeyMilyaevcommented, Jun 29, 2021

@FlorinAndrei there is no special code for running training on multiple GPUs with NVLink. You can use my DataParallel example or Apex example. Just be sure, that for the connected with NVLink cards the output of the command ‘nvidia-smi nvlink’ confirms that the NVLink is enabled as you can see in my original post.

1reaction

mcarillicommented, May 3, 2019

Good! The batch size numbers I quoted were for a V100 with 16GB of onboard memory, so for your 2080Tis with (I believe) 11GB of onboard memory, the batch sizes you are using for FP32 and mixed precision are likely achieving decent saturation.

Typically I work on systems like

DGXs where everything is connected via nvlink,
DGX2s where everything is connected via nvswitch, or
minimal desktop setups with no nvlink.

I don’t have access to a setup like yours where it’s possible to get an apples to apples comparison of nvlink versus no nvlink. I think the speedup it’s delivering in your case is reasonable, given that for a 2-GPU job the comms are probably a smallish fraction of each iteration’s runtime. For Test 4 (training with O2) the data that’s being allreduced is (mostly) FP16, so the total amount of time and probably the fraction of per-iteration time spent in comms is reduced, so it’s sensible to me that nvlink has less of a positive impact for Test 4 than it does for Test 3 (FP32).

It’s difficult to predict exactly what speedup nvlink will deliver, because this depends on ratio of comms to compute time within each device (which depends on the network, individual bandwidth/computational horsepower within each device, per-device batch size, and number of total processes/GPUs in the job), but it should always be better than PCIE. In general, the more GPUs your job uses (within a node), the more helpful nvlink becomes.

Top Results From Across the Web

NVLink & NVSwitch: Fastest HPC Data Center Platform

NVLink Performance. NVLink in NVIDIA H100 increases inter-GPU communication bandwidth 1.5X compared to the previous generation, so researchers can use ...

NVLink Performance in DaVinci Resolve 17.0

NVLink is a very interesting technology from NVIDIA that allows GPUs to be able to directly communicate with each other at speed up...

Dual NVIDIA GeForce RTX 3090 NVLink Performance ...

In our dual NVIDIA GeForce RTX 3090 NVLink compute performance review, we see how scaling to multiple GPUs impacts performance.

NVLink vs. SLI and Multiple GPUs - Is it worth it?

NVLink is still not a magical switch that you can turn on to gain more performance. NVLink is great in some use cases,...

How Nvidia's NVLink Boosts GPU Performance

NVLink is a new feature for Nvidia GPUs that aims to drastically improve performance by increasing the total bandwidth between the GPU and ......