question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ParallelUpdater cannot fully exploit multiple GPUs

See original GitHub issue

ParallelUpdater improves performance when using simple models and/or large mini batches. On the other hand, for complex models, ParallelUpdater currently fails to exploit multiple GPUs, and it becomes almost sequential.

I examined the profiler results, and concluded that Python is just too slow to issue kernels for multiple GPUs. The possible way to cope with this issue is to use multiple processes in ParallelUpdater. To enable that, cupy needs the feature to handle inter-process communication. It can be done using the inter-process memory handle of CUDA (http://docs.nvidia.com/cuda/cuda-c-programming-guide/#interprocess-communication).

The following is a profiler result with 4 GPUs and ResNet-152. It clearly shows that multiple GPUs are used in sequential.

2016-08-10 18 39 22

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
jekbradburycommented, Sep 13, 2016

I have a patch which uses CUDA’s interprocess communication and multiprocessing to give near-linear speedup for data parallelism across multiple GPUs, but it was written for Chainer 1.5 and I’d have to do some work to bring it up to compatibility with the latest version. Is there interest in this? I’m not sure when I’ll get time to work on it, but it’ll probably be in the next few weeks.

0reactions
jekbradburycommented, Nov 8, 2016

I spent some time looking through the patch I wrote earlier; it was fairly specific to the task I was using it for (machine translation) and would need a lot of work to make it general enough. It sounds like what anaruse has is more helpful, but I’m happy to help in any way I can. Here’s my old code in case it’s useful to anyone (the only thing this depends on is a wrapper for CUDA’s ipcMemHandle API).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficient Training on Multiple GPUs - Hugging Face
Switching from a single GPU to multiple requires some form of parallelism as the work needs to be distributed.
Read more >
Using Multiple GPUs - Deeplearning4j
In this tutorial we will use the MNIST dataset (dataset of handwritten images) to train a feed forward neural network in parallel with...
Read more >
Cannot utilize fully all GPUs during network training - MathWorks
For some reason I see that all GPU are working (see GPU.png) but for limited amount of time (very high usage for 3...
Read more >
Using GPU(s) in Chainer
In order to enable data-parallel computation with multiple GPUs, we only have to replace it with ParallelUpdater . updater = training.updaters.ParallelUpdater( ...
Read more >
Train 1 trillion+ parameter models - PyTorch Lightning
Placement policies can help users fully exploit their GPU-CPU heterogeneous ... This allows you to fit much larger models onto multiple GPUs into...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found