Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ParallelUpdater cannot fully exploit multiple GPUs

See original GitHub issue

ParallelUpdater improves performance when using simple models and/or large mini batches. On the other hand, for complex models, ParallelUpdater currently fails to exploit multiple GPUs, and it becomes almost sequential.

I examined the profiler results, and concluded that Python is just too slow to issue kernels for multiple GPUs. The possible way to cope with this issue is to use multiple processes in ParallelUpdater. To enable that, cupy needs the feature to handle inter-process communication. It can be done using the inter-process memory handle of CUDA (http://docs.nvidia.com/cuda/cuda-c-programming-guide/#interprocess-communication).

The following is a profiler result with 4 GPUs and ResNet-152. It clearly shows that multiple GPUs are used in sequential.

Issue Analytics

State:
Created 7 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

3reactions

jekbradburycommented, Sep 13, 2016

I have a patch which uses CUDA’s interprocess communication and multiprocessing to give near-linear speedup for data parallelism across multiple GPUs, but it was written for Chainer 1.5 and I’d have to do some work to bring it up to compatibility with the latest version. Is there interest in this? I’m not sure when I’ll get time to work on it, but it’ll probably be in the next few weeks.

0reactions

jekbradburycommented, Nov 8, 2016

I spent some time looking through the patch I wrote earlier; it was fairly specific to the task I was using it for (machine translation) and would need a lot of work to make it general enough. It sounds like what anaruse has is more helpful, but I’m happy to help in any way I can. Here’s my old code in case it’s useful to anyone (the only thing this depends on is a wrapper for CUDA’s ipcMemHandle API).

Top Results From Across the Web

Efficient Training on Multiple GPUs - Hugging Face

Switching from a single GPU to multiple requires some form of parallelism as the work needs to be distributed.

Using Multiple GPUs - Deeplearning4j

In this tutorial we will use the MNIST dataset (dataset of handwritten images) to train a feed forward neural network in parallel with...

Cannot utilize fully all GPUs during network training - MathWorks

For some reason I see that all GPU are working (see GPU.png) but for limited amount of time (very high usage for 3...

Using GPU(s) in Chainer

In order to enable data-parallel computation with multiple GPUs, we only have to replace it with ParallelUpdater . updater = training.updaters.ParallelUpdater( ...

Train 1 trillion+ parameter models - PyTorch Lightning

Placement policies can help users fully exploit their GPU-CPU heterogeneous ... This allows you to fit much larger models onto multiple GPUs into...