Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TransformerXL: StopIteration: Caught StopIteration in replica 0 on device 0

See original GitHub issue

Environment info

transformers version: 3.4.0
Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-debian-stretch-sid
Python version: 3.6.9
PyTorch version (GPU?): 1.6.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

@TevenLeScao

Error I get

Traceback (most recent call last):
  File "/ai/fzc/minGPT/transformerXLtest.py", line 163, in <module>
    input_ids=inputs["input_ids"].to(device),
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/modeling_transfo_xl.py", line 866, in forward
    mems = self.init_mems(bsz)
  File "/opt/conda/lib/python3.6/site-packages/transformers/modeling_transfo_xl.py", line 800, in init_mems
    param = next(self.parameters())
StopIteration

To reproduce the problem

Run Code below:

import torch
from torch.nn import DataParallel
from transformers import TransfoXLTokenizer, TransfoXLModel

device = "cuda:0"

# Get model
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLModel.from_pretrained('transfo-xl-wt103', return_dict=True)
model = DataParallel(model, device_ids=list(range(torch.cuda.device_count())))
model.to(device=device)

# Run forward
inputs = tokenizer(["This is an example"], return_tensors="pt")

outputs = model(
    input_ids=inputs["input_ids"].to(device),
)

print(f"outputs: {outputs}")
print("Success.")

Issue Analytics

State:
Created 3 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

1reaction

TevenLeScaocommented, Nov 10, 2020

As I said in my previous post, you can just use single-GPU or distributed training instead. Of course transformer-xl is supported ; but we cannot update its code to bypass the Pytorch issues with DataParallel without breaking backwards compatibility with previous checkpoints.

1reaction

TevenLeScaocommented, Nov 3, 2020

As of now, Pytorch doesn’t support calling self.parameters() within DataParallel, which causes the current issue. Even after fixing that, which was straightforward, Pytorch also doesn’t support calling self.ParameterList and self.ParameterDict, which are also used in TransfoXL, which will cause another issue. As Pytorch is moving people away from DataParallel, they are unlikely to fix this anytime soon on their end. On our end, this is going to be much harder to fix in a non-BC way, as changing the way the model is organized means previous checkpoints cannot be loaded. In the meantime, you could use DistributedDataParallel instead.