ParallelUpdater cannot fully exploit multiple GPUs
See original GitHub issueParallelUpdater
improves performance when using simple models and/or large mini batches. On the other hand, for complex models, ParallelUpdater
currently fails to exploit multiple GPUs, and it becomes almost sequential.
I examined the profiler results, and concluded that Python is just too slow to issue kernels for multiple GPUs. The possible way to cope with this issue is to use multiple processes in ParallelUpdater
. To enable that, cupy needs the feature to handle inter-process communication. It can be done using the inter-process memory handle of CUDA (http://docs.nvidia.com/cuda/cuda-c-programming-guide/#interprocess-communication).
The following is a profiler result with 4 GPUs and ResNet-152. It clearly shows that multiple GPUs are used in sequential.
Issue Analytics
- State:
- Created 7 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
I have a patch which uses CUDA’s interprocess communication and
multiprocessing
to give near-linear speedup for data parallelism across multiple GPUs, but it was written for Chainer 1.5 and I’d have to do some work to bring it up to compatibility with the latest version. Is there interest in this? I’m not sure when I’ll get time to work on it, but it’ll probably be in the next few weeks.I spent some time looking through the patch I wrote earlier; it was fairly specific to the task I was using it for (machine translation) and would need a lot of work to make it general enough. It sounds like what anaruse has is more helpful, but I’m happy to help in any way I can. Here’s my old code in case it’s useful to anyone (the only thing this depends on is a wrapper for CUDA’s ipcMemHandle API).