Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi-GPU scattering2d [torch]

See original GitHub issue

Hi everyone,

Close #640

In order for scattering2d torch implementation to fully leverage the fact that it inherits nn.Module and thus be parallelizable with nn.DataParallel, I believe the following lines shall be modified in kymatio-v2 branch:

https://github.com/kymatio/kymatio/blob/5bf71fd7aab1e60bb12cfaa4a2b42d722a91e893/kymatio/scattering2d/frontend/torch_frontend.py#L35

https://github.com/kymatio/kymatio/blob/5bf71fd7aab1e60bb12cfaa4a2b42d722a91e893/kymatio/scattering2d/frontend/torch_frontend.py#L43

https://github.com/kymatio/kymatio/blob/5bf71fd7aab1e60bb12cfaa4a2b42d722a91e893/kymatio/scattering2d/frontend/torch_frontend.py#L57

https://github.com/kymatio/kymatio/blob/5bf71fd7aab1e60bb12cfaa4a2b42d722a91e893/kymatio/scattering2d/frontend/torch_frontend.py#L65

to respectively simply:

self.register_single_filter(phi, n) self.register_single_filter(v, n)

and:

phis = copy.deepcopy(self.phi) psis = copy.deepcopy(self.psi)

Indeed, self.phi and self.psi being dicts, at each forward pass, the replica models built by the replicate.py function of DataParallel on each GPU device all share the same underlying self.phi and self.psi dicts. If we assign the named buffers directly to those dicts (in the first 2 lines I have mentioned), then as those buffers on the other hand are replicated separately on each GPU, this means ultimately that the self.phi[c] and self.psi[j][k] will only point to one GPU while inputs will be scattered on all GPUs and as such will lead to a TypeError: Input and filter must be on the same GPU.

The problem is similar for the other 2 lines and one workaround is thus for each replica model to have its own copy of the phi and psi dicts. Another workaround would be to pass as well a buffer dict in the scattering call:

https://github.com/kymatio/kymatio/blob/5bf71fd7aab1e60bb12cfaa4a2b42d722a91e893/kymatio/scattering2d/frontend/torch_frontend.py#L125-L126

and load the filters within the scattering core function (but would be less generic).

Proposed solution seems to work in multi-gpu by for instance slightly modifying following lines of cifar.py in examples/2d:

https://github.com/kymatio/kymatio/blob/5bf71fd7aab1e60bb12cfaa4a2b42d722a91e893/examples/2d/cifar.py#L133-L141

by:

    if use_cuda:
        scattering = torch.nn.DataParallel(scattering).cuda()

    model = Scattering2dCNN(K, args.classifier)

    if use_cuda:
        model = torch.nn.DataParallel(Scattering2dCNN(K,args.classifier)).cuda()

    # DataLoaders

Seems to work most of the time with 2 GPUs, a bit more randomly with 4 GPUs where I can get sometimes a Segmentation fault (core dumped) issue, which using faulthandler and faulthandler.enable() gives the following error:

Thread 0x00007f4b5c889700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/scattering2d/backend/torch_backend.py", line 231 in fft
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/scattering2d/core/scattering2d.py", line 23 in scattering2d
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/scattering2d/frontend/torch_frontend.py", line 126 in scattering
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/frontend/torch_frontend.py", line 20 in forward
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532 in __call__
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60 in _worker
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Current thread 0x00007f4b5d08a700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/scattering2d/backend/torch_backend.py", line 231 in fft
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/scattering2d/core/scattering2d.py", line 23 in scattering2d
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/scattering2d/frontend/torch_frontend.py", line 126 in scattering
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/frontend/torch_frontend.py", line 20 in forward
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532 in __call__
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60 in _worker
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f4b5f7fe700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 296 in wait
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/queues.py", line 224 in _feed
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f4b5ffff700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 296 in wait
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/queues.py", line 224 in _feed
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f4b80ff9700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 296 in wait
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/queues.py", line 224 in _feed
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f4b817fa700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 296 in wait
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/queues.py", line 224 in _feed
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f4b81ffb700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/selectors.py", line 415 in select
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 920 in wait
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 414 in _poll
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 257 in poll
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/queues.py", line 104 in get
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 25 in _pin_memory_loop
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f4ceeacb700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 1060 in _wait_for_tstate_lock
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 1044 in join
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 77 in parallel_apply
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162 in parallel_apply
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152 in forward
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532 in __call__
  File "cifar.py", line 75 in train
  File "cifar.py", line 177 in main
  File "cifar.py", line 184 in <module>

Tested on Ubuntu 16.04 and 18.04, torch 1.4.0, torchvision 0.5.0 (got similar behaviors with 1.3.1 and 0.4.2)

Issue Analytics

State:
Created 4 years ago
Comments:14

Top GitHub Comments

1reaction

edouardoyalloncommented, Jun 12, 2020

@MuawizChaudhary @eickenberg it’s fixed on my machine. Can you confirm?

0reactions

edouardoyalloncommented, Jan 12, 2021

Please close only an issue when it’s fixed… and otherwise refer to the fix in the issue…

Top Results From Across the Web

Efficient Training on Multiple GPUs - Hugging Face

We will first discuss in depth various 1D parallelism techniques and their pros and cons and then look at how they can be...

Multi-GPU Examples — PyTorch Tutorials 1.13.1+cu117 ...

Data Parallelism is implemented using torch.nn.DataParallel . One can wrap a Module in DataParallel and it will be parallelized over multiple GPUs in...

Fast Multi-GPU collectives with NCCL | NVIDIA Technical Blog

NCCL (pronounced “Nickel”) is a library of multi-GPU collective ... all-reduce, broadcast, reduce, and reduce-scatter collectives.

Kymatio: Wavelet scattering in Python - v0.3.0 “Erdre ...

runs seamlessly on CPU and GPU hardware, with major deep learning APIs, ... from kymatio.torch import Scattering2D scattering = Scattering2D(J=2, shape=(32, ...

Pytorch Tutorial from Basic to Advance Level - Kunal Bhashkar

For example, In PyTorch, 1d-tensor is a vector, 2d-tensor is a metrics, 3d- tensor is a ... Torch provides tensor computation with strong...