question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi-GPU scattering2d [torch]

See original GitHub issue

Hi everyone,

Close #640

In order for scattering2d torch implementation to fully leverage the fact that it inherits nn.Module and thus be parallelizable with nn.DataParallel, I believe the following lines shall be modified in kymatio-v2 branch:

https://github.com/kymatio/kymatio/blob/5bf71fd7aab1e60bb12cfaa4a2b42d722a91e893/kymatio/scattering2d/frontend/torch_frontend.py#L35

https://github.com/kymatio/kymatio/blob/5bf71fd7aab1e60bb12cfaa4a2b42d722a91e893/kymatio/scattering2d/frontend/torch_frontend.py#L43

https://github.com/kymatio/kymatio/blob/5bf71fd7aab1e60bb12cfaa4a2b42d722a91e893/kymatio/scattering2d/frontend/torch_frontend.py#L57

https://github.com/kymatio/kymatio/blob/5bf71fd7aab1e60bb12cfaa4a2b42d722a91e893/kymatio/scattering2d/frontend/torch_frontend.py#L65

to respectively simply:

self.register_single_filter(phi, n) self.register_single_filter(v, n)

and:

phis = copy.deepcopy(self.phi) psis = copy.deepcopy(self.psi)

Indeed, self.phi and self.psi being dicts, at each forward pass, the replica models built by the replicate.py function of DataParallel on each GPU device all share the same underlying self.phi and self.psi dicts. If we assign the named buffers directly to those dicts (in the first 2 lines I have mentioned), then as those buffers on the other hand are replicated separately on each GPU, this means ultimately that the self.phi[c] and self.psi[j][k] will only point to one GPU while inputs will be scattered on all GPUs and as such will lead to a TypeError: Input and filter must be on the same GPU.

The problem is similar for the other 2 lines and one workaround is thus for each replica model to have its own copy of the phi and psi dicts. Another workaround would be to pass as well a buffer dict in the scattering call:

https://github.com/kymatio/kymatio/blob/5bf71fd7aab1e60bb12cfaa4a2b42d722a91e893/kymatio/scattering2d/frontend/torch_frontend.py#L125-L126

and load the filters within the scattering core function (but would be less generic).

Proposed solution seems to work in multi-gpu by for instance slightly modifying following lines of cifar.py in examples/2d:

https://github.com/kymatio/kymatio/blob/5bf71fd7aab1e60bb12cfaa4a2b42d722a91e893/examples/2d/cifar.py#L133-L141

by:

    if use_cuda:
        scattering = torch.nn.DataParallel(scattering).cuda()

    model = Scattering2dCNN(K, args.classifier)

    if use_cuda:
        model = torch.nn.DataParallel(Scattering2dCNN(K,args.classifier)).cuda()

    # DataLoaders 

Seems to work most of the time with 2 GPUs, a bit more randomly with 4 GPUs where I can get sometimes a Segmentation fault (core dumped) issue, which using faulthandler and faulthandler.enable() gives the following error:

Thread 0x00007f4b5c889700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/scattering2d/backend/torch_backend.py", line 231 in fft
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/scattering2d/core/scattering2d.py", line 23 in scattering2d
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/scattering2d/frontend/torch_frontend.py", line 126 in scattering
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/frontend/torch_frontend.py", line 20 in forward
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532 in __call__
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60 in _worker
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Current thread 0x00007f4b5d08a700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/scattering2d/backend/torch_backend.py", line 231 in fft
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/scattering2d/core/scattering2d.py", line 23 in scattering2d
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/scattering2d/frontend/torch_frontend.py", line 126 in scattering
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/kymatio/frontend/torch_frontend.py", line 20 in forward
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532 in __call__
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60 in _worker
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f4b5f7fe700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 296 in wait
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/queues.py", line 224 in _feed
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f4b5ffff700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 296 in wait
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/queues.py", line 224 in _feed
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f4b80ff9700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 296 in wait
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/queues.py", line 224 in _feed
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f4b817fa700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 296 in wait
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/queues.py", line 224 in _feed
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f4b81ffb700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/selectors.py", line 415 in select
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 920 in wait
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 414 in _poll
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 257 in poll
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/multiprocessing/queues.py", line 104 in get
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 25 in _pin_memory_loop
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 870 in run
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f4ceeacb700 (most recent call first):
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 1060 in _wait_for_tstate_lock
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/threading.py", line 1044 in join
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 77 in parallel_apply
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162 in parallel_apply
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152 in forward
  File "/users/data/zarka/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532 in __call__
  File "cifar.py", line 75 in train
  File "cifar.py", line 177 in main
  File "cifar.py", line 184 in <module>

Tested on Ubuntu 16.04 and 18.04, torch 1.4.0, torchvision 0.5.0 (got similar behaviors with 1.3.1 and 0.4.2)

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14

github_iconTop GitHub Comments

1reaction
edouardoyalloncommented, Jun 12, 2020

@MuawizChaudhary @eickenberg it’s fixed on my machine. Can you confirm?

0reactions
edouardoyalloncommented, Jan 12, 2021

Please close only an issue when it’s fixed… and otherwise refer to the fix in the issue…

Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficient Training on Multiple GPUs - Hugging Face
We will first discuss in depth various 1D parallelism techniques and their pros and cons and then look at how they can be...
Read more >
Multi-GPU Examples — PyTorch Tutorials 1.13.1+cu117 ...
Data Parallelism is implemented using torch.nn.DataParallel . One can wrap a Module in DataParallel and it will be parallelized over multiple GPUs in...
Read more >
Fast Multi-GPU collectives with NCCL | NVIDIA Technical Blog
NCCL (pronounced “Nickel”) is a library of multi-GPU collective ... all-reduce, broadcast, reduce, and reduce-scatter collectives.
Read more >
Kymatio: Wavelet scattering in Python - v0.3.0 “Erdre ...
runs seamlessly on CPU and GPU hardware, with major deep learning APIs, ... from kymatio.torch import Scattering2D scattering = Scattering2D(J=2, shape=(32, ...
Read more >
Pytorch Tutorial from Basic to Advance Level - Kunal Bhashkar
For example, In PyTorch, 1d-tensor is a vector, 2d-tensor is a metrics, 3d- tensor is a ... Torch provides tensor computation with strong...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found