question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance with NVLink

See original GitHub issue

Hi everyone, we started testing 2080ti with Quadro NVLink bridge. We have a machine with 4 2080ti GPU cards. We connected the cards 2 and 3 with the NVLink bridge, while the cards 0 and 1 were left without bridge connection. The output of nvlink status shows that everything is fine

nvidia-smi nvlink --status -i 2
GPU 2: GeForce RTX 2080 Ti (UUID: GPU-6c41afdf-d8aa-ed9b-24a2-fa90c471fda0)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
nvidia-smi nvlink --status -i 3
GPU 3: GeForce RTX 2080 Ti (UUID: GPU-a78f4e04-05cc-74fe-5d3a-c2ad0f069e05)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s

I also checked p2p connectivity, its output below also seems to be fine. p2p_test.txt

Then I made slight modifications to the PyTorch ImageNet example and Apex ImageNet example, so that they use random input instead of ImageNet data.

I decided to run 4 tests:

  • Test 1, PyTorch DataParallel with ResNet18: python main.py dummy
  • Test 2, PyTorch DistributedDataParallel with ResNet18: python main.py dummy --dist-backend ‘nccl’ --multiprocessing-distributed --world-size 1 --rank 0 --dist-url ‘tcp://127.0.0.1:2504’ dummy
  • Test 3, Apex FP32 training: python main_amp.py -a resnet50 --b 64 --workers 4 --opt-level O0 dummy
  • Test 4, Apex FP16 training: python main1_amp.py -a resnet50 --b 64 --workers 4 --opt-level O2 dummy

As a result of time measurements for batch processing I took the output of a time per batch running average at the end of the first training epoch in each test. Tests with NVLink were run on cards 2 and 3, tests without NVLink - on cards 0 and 1. While for the Test 1 a batch processing time reduced with NVLink from 0.190 to 0.182 seconds and for the Test 2 from 0.172 to 0.166 seconds, for Test 3 NVLink connection increased a batch processing time from 0.249 to 0.278 seconds and for Test 4 from 0.172 to 0.202 seconds. Does anybody observed the similar performance drop with using NVLink and Apex together or could suggest more appropriate performance tests?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
SergeyMilyaevcommented, Jun 29, 2021

@FlorinAndrei there is no special code for running training on multiple GPUs with NVLink. You can use my DataParallel example or Apex example. Just be sure, that for the connected with NVLink cards the output of the command ‘nvidia-smi nvlink’ confirms that the NVLink is enabled as you can see in my original post.

1reaction
mcarillicommented, May 3, 2019

Good! The batch size numbers I quoted were for a V100 with 16GB of onboard memory, so for your 2080Tis with (I believe) 11GB of onboard memory, the batch sizes you are using for FP32 and mixed precision are likely achieving decent saturation.

Typically I work on systems like

  • DGXs where everything is connected via nvlink,
  • DGX2s where everything is connected via nvswitch, or
  • minimal desktop setups with no nvlink.

I don’t have access to a setup like yours where it’s possible to get an apples to apples comparison of nvlink versus no nvlink. I think the speedup it’s delivering in your case is reasonable, given that for a 2-GPU job the comms are probably a smallish fraction of each iteration’s runtime. For Test 4 (training with O2) the data that’s being allreduced is (mostly) FP16, so the total amount of time and probably the fraction of per-iteration time spent in comms is reduced, so it’s sensible to me that nvlink has less of a positive impact for Test 4 than it does for Test 3 (FP32).

It’s difficult to predict exactly what speedup nvlink will deliver, because this depends on ratio of comms to compute time within each device (which depends on the network, individual bandwidth/computational horsepower within each device, per-device batch size, and number of total processes/GPUs in the job), but it should always be better than PCIE. In general, the more GPUs your job uses (within a node), the more helpful nvlink becomes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

NVLink & NVSwitch: Fastest HPC Data Center Platform
NVLink Performance. NVLink in NVIDIA H100 increases inter-GPU communication bandwidth 1.5X compared to the previous generation, so researchers can use ...
Read more >
NVLink Performance in DaVinci Resolve 17.0
NVLink is a very interesting technology from NVIDIA that allows GPUs to be able to directly communicate with each other at speed up...
Read more >
Dual NVIDIA GeForce RTX 3090 NVLink Performance ...
In our dual NVIDIA GeForce RTX 3090 NVLink compute performance review, we see how scaling to multiple GPUs impacts performance.
Read more >
NVLink vs. SLI and Multiple GPUs - Is it worth it?
NVLink is still not a magical switch that you can turn on to gain more performance. NVLink is great in some use cases,...
Read more >
How Nvidia's NVLink Boosts GPU Performance
NVLink is a new feature for Nvidia GPUs that aims to drastically improve performance by increasing the total bandwidth between the GPU and ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found