question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TransformerXL: StopIteration: Caught StopIteration in replica 0 on device 0

See original GitHub issue

Environment info

  • transformers version: 3.4.0
  • Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-debian-stretch-sid
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.6.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

@TevenLeScao

Error I get

Traceback (most recent call last):
  File "/ai/fzc/minGPT/transformerXLtest.py", line 163, in <module>
    input_ids=inputs["input_ids"].to(device),
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/modeling_transfo_xl.py", line 866, in forward
    mems = self.init_mems(bsz)
  File "/opt/conda/lib/python3.6/site-packages/transformers/modeling_transfo_xl.py", line 800, in init_mems
    param = next(self.parameters())
StopIteration

To reproduce the problem

Run Code below:

import torch
from torch.nn import DataParallel
from transformers import TransfoXLTokenizer, TransfoXLModel

device = "cuda:0"

# Get model
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLModel.from_pretrained('transfo-xl-wt103', return_dict=True)
model = DataParallel(model, device_ids=list(range(torch.cuda.device_count())))
model.to(device=device)

# Run forward
inputs = tokenizer(["This is an example"], return_tensors="pt")

outputs = model(
    input_ids=inputs["input_ids"].to(device),
)

print(f"outputs: {outputs}")
print("Success.")

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
TevenLeScaocommented, Nov 10, 2020

As I said in my previous post, you can just use single-GPU or distributed training instead. Of course transformer-xl is supported ; but we cannot update its code to bypass the Pytorch issues with DataParallel without breaking backwards compatibility with previous checkpoints.

1reaction
TevenLeScaocommented, Nov 3, 2020

As of now, Pytorch doesn’t support calling self.parameters() within DataParallel, which causes the current issue. Even after fixing that, which was straightforward, Pytorch also doesn’t support calling self.ParameterList and self.ParameterDict, which are also used in TransfoXL, which will cause another issue. As Pytorch is moving people away from DataParallel, they are unlikely to fix this anytime soon on their end. On our end, this is going to be much harder to fix in a non-BC way, as changing the way the model is organized means previous checkpoints cannot be loaded. In the meantime, you could use DistributedDataParallel instead.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pytorch - Caught StopIteration in replica 1 on device 1 error ...
While running on the server, with 4 GPUs enabled, below is the error I get: StopIteration: Caught StopIteration in replica 1 on device...
Read more >
Caught StopIteration in replica 0 on device 0 - PyTorch Forums
I don't think the parameters() method changed in the last year(s) so wouldn't think the error is caused by an update in the...
Read more >
Caught StopIteration in replica 0 on device 0. 问题排查与解决
经过上网查找,发现可能是在训练过程中部分数据的精度不同导致的问题,可能同时存在16位精度和32位精度的数据,尝试在这里进行修改,将其直接指定为torch.
Read more >
Source code for transformers.modeling_utils - Hugging Face
for module in self.modules(): module.mem_rss_diff = 0 ... try: return next(self.parameters()).device except StopIteration: # For nn.
Read more >
【transformer-xl】模型相关- AliceYing - 博客园
错误:StopIteration: Caught StopIteration in replica 0 on device 0. 解决:. 降低pytorch的版本,测试1.3.1是可以正常训练的,1.4.0应该也可以.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found