TransformerXL: StopIteration: Caught StopIteration in replica 0 on device 0
See original GitHub issueEnvironment info
transformers
version: 3.4.0- Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-debian-stretch-sid
- Python version: 3.6.9
- PyTorch version (GPU?): 1.6.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes
Error I get
Traceback (most recent call last):
File "/ai/fzc/minGPT/transformerXLtest.py", line 163, in <module>
input_ids=inputs["input_ids"].to(device),
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/transformers/modeling_transfo_xl.py", line 866, in forward
mems = self.init_mems(bsz)
File "/opt/conda/lib/python3.6/site-packages/transformers/modeling_transfo_xl.py", line 800, in init_mems
param = next(self.parameters())
StopIteration
To reproduce the problem
Run Code below:
import torch
from torch.nn import DataParallel
from transformers import TransfoXLTokenizer, TransfoXLModel
device = "cuda:0"
# Get model
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLModel.from_pretrained('transfo-xl-wt103', return_dict=True)
model = DataParallel(model, device_ids=list(range(torch.cuda.device_count())))
model.to(device=device)
# Run forward
inputs = tokenizer(["This is an example"], return_tensors="pt")
outputs = model(
input_ids=inputs["input_ids"].to(device),
)
print(f"outputs: {outputs}")
print("Success.")
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (7 by maintainers)
Top Results From Across the Web
Pytorch - Caught StopIteration in replica 1 on device 1 error ...
While running on the server, with 4 GPUs enabled, below is the error I get: StopIteration: Caught StopIteration in replica 1 on device...
Read more >Caught StopIteration in replica 0 on device 0 - PyTorch Forums
I don't think the parameters() method changed in the last year(s) so wouldn't think the error is caused by an update in the...
Read more >Caught StopIteration in replica 0 on device 0. 问题排查与解决
经过上网查找,发现可能是在训练过程中部分数据的精度不同导致的问题,可能同时存在16位精度和32位精度的数据,尝试在这里进行修改,将其直接指定为torch.
Read more >Source code for transformers.modeling_utils - Hugging Face
for module in self.modules(): module.mem_rss_diff = 0 ... try: return next(self.parameters()).device except StopIteration: # For nn.
Read more >【transformer-xl】模型相关- AliceYing - 博客园
错误:StopIteration: Caught StopIteration in replica 0 on device 0. 解决:. 降低pytorch的版本,测试1.3.1是可以正常训练的,1.4.0应该也可以.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
As I said in my previous post, you can just use single-GPU or distributed training instead. Of course transformer-xl is supported ; but we cannot update its code to bypass the Pytorch issues with
DataParallel
without breaking backwards compatibility with previous checkpoints.As of now, Pytorch doesn’t support calling
self.parameters()
withinDataParallel
, which causes the current issue. Even after fixing that, which was straightforward, Pytorch also doesn’t support callingself.ParameterList
andself.ParameterDict
, which are also used in TransfoXL, which will cause another issue. As Pytorch is moving people away fromDataParallel
, they are unlikely to fix this anytime soon on their end. On our end, this is going to be much harder to fix in a non-BC way, as changing the way the model is organized means previous checkpoints cannot be loaded. In the meantime, you could useDistributedDataParallel
instead.