question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TypeError exception in AxialPositionalEncoding when using DataParallel

See original GitHub issue

Hello,

I want to run SinkhornTransformerLM using multiple GPUs, so I’m wrapping the model into torch.nn.DataParallel. However, when I do this, I get an exception:

Traceback (most recent call last):
  File "script.py", line 27, in <module>
    model(x)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/sinkhorn_transformer.py", line 792, in forward
    x = self.axial_pos_emb(x) + x
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/sinkhorn_transformer.py", line 243, in forward
    return pos_emb[:, :t]
TypeError: 'int' object is not subscriptable

Looking at the code, it would seem that self.weights does not get populated. To reproduce this error, I took the first example in README.md and changed

model(x) # (1, 2048, 20000)

to

model = torch.nn.DataParallel(model, device_ids=list(range(torch.cuda.device_count()))).to('cuda')
model(x)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
kl0211commented, May 5, 2020

@kl0211 oh ok, I put in a temporary fix, should work now!

very cool! I’d like to know how that turns out!

@lucidrains, Looks like your fix got it to work! Thanks a bunch!

@kl0211 you should try Deepspeed. DataParallel actually doesn’t give you a very big speed up

Cool! I’ll see if I can try it out. Thanks for the tip!

0reactions
lucidrainscommented, May 5, 2020

@kl0211 do share your results! this repository is still in the exploratory phase!

Read more comments on GitHub >

github_iconTop Results From Across the Web

I have a trouble after trying to parallelize data using nn ...
DataParallel(model)" the error message "TypeError: 'list' object is not callable" comes. If I push out that damn line the source works ...
Read more >
Error in torch.nn.DataParallel - PyTorch Forums
Hi. I would like to use “DataParallel” in DNN training in Pytorch but get some errors. Before I use “DataParallel”, the code is;....
Read more >
Data Parallel Troubleshooting - Amazon SageMaker
Using SageMaker Distributed Data Parallel with Amazon SageMaker Debugger and Checkpoints. To monitor system bottlenecks, profile framework operations, and debug ...
Read more >
How to use multiple gpus - fastai dev - fast.ai Course Forums
Dataparallel(model), and then using this model in learner gives ... zip(*outputs))) TypeError: zip argument #1 must support iteration. Use ...
Read more >
How to convert a PyTorch DataParallel project to use ...
We will also need the comm file, which gives some nice functionality for handling distribution resources. The wrapper code copied from ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found