Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TypeError exception in AxialPositionalEncoding when using DataParallel

See original GitHub issue

Hello,

I want to run SinkhornTransformerLM using multiple GPUs, so I’m wrapping the model into torch.nn.DataParallel. However, when I do this, I get an exception:

Traceback (most recent call last):
  File "script.py", line 27, in <module>
    model(x)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/sinkhorn_transformer.py", line 792, in forward
    x = self.axial_pos_emb(x) + x
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/sinkhorn_transformer.py", line 243, in forward
    return pos_emb[:, :t]
TypeError: 'int' object is not subscriptable

Looking at the code, it would seem that self.weights does not get populated. To reproduce this error, I took the first example in README.md and changed

model(x) # (1, 2048, 20000)

model = torch.nn.DataParallel(model, device_ids=list(range(torch.cuda.device_count()))).to('cuda')
model(x)

Issue Analytics

State:
Created 3 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

kl0211commented, May 5, 2020

@kl0211 oh ok, I put in a temporary fix, should work now!

very cool! I’d like to know how that turns out!

@lucidrains, Looks like your fix got it to work! Thanks a bunch!

@kl0211 you should try Deepspeed. DataParallel actually doesn’t give you a very big speed up

Cool! I’ll see if I can try it out. Thanks for the tip!

0reactions

lucidrainscommented, May 5, 2020

@kl0211 do share your results! this repository is still in the exploratory phase!

Top Results From Across the Web

I have a trouble after trying to parallelize data using nn ...

DataParallel(model)" the error message "TypeError: 'list' object is not callable" comes. If I push out that damn line the source works ...

Error in torch.nn.DataParallel - PyTorch Forums

Hi. I would like to use “DataParallel” in DNN training in Pytorch but get some errors. Before I use “DataParallel”, the code is;....

Data Parallel Troubleshooting - Amazon SageMaker

Using SageMaker Distributed Data Parallel with Amazon SageMaker Debugger and Checkpoints. To monitor system bottlenecks, profile framework operations, and debug ...

How to use multiple gpus - fastai dev - fast.ai Course Forums

Dataparallel(model), and then using this model in learner gives ... zip(*outputs))) TypeError: zip argument #1 must support iteration. Use ...

How to convert a PyTorch DataParallel project to use ...

We will also need the comm file, which gives some nice functionality for handling distribution resources. The wrapper code copied from ...