question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

getting error while training LayoutLMV2 model on multi gpu setup

See original GitHub issue

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameters which did not receive grad for rank 0: layoutlmv2.pooler.dense.bias, layoutlmv2.pooler.dense.weight, layoutlmv2.visual.backbone.fpn_output5.bias, layoutlmv2.visual.backbone.fpn_output5.weight, layoutlmv2.visual.backbone.fpn_output4.bias, layoutlmv2.visual.backbone.fpn_output4.weight, layoutlmv2.visual.backbone.fpn_output3.bias, layoutlmv2.visual.backbone.fpn_output3.weight Parameter indices which did not receive grad for rank 0: 16 17 20 21 24 25 510 511 Traceback (most recent call last): File “LayoutLMv2 Best Model - 0.7873 - Finetuning on LS Internal All Data.py”, line 788, in <module> outputs = model(**train_batch) File “/root/py38/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1102, in _call_impl return forward_call(*input, **kwargs) File “/root/py38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 873, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameters which did not receive grad for rank 1: layoutlmv2.pooler.dense.bias, layoutlmv2.pooler.dense.weight, layoutlmv2.visual.backbone.fpn_output5.bias, layoutlmv2.visual.backbone.fpn_output5.weight, layoutlmv2.visual.backbone.fpn_output4.bias, layoutlmv2.visual.backbone.fpn_output4.weight, layoutlmv2.visual.backbone.fpn_output3.bias, layoutlmv2.visual.backbone.fpn_output3.weight Parameter indices which did not receive grad for rank 1: 16 17 20 21 24 25 510 511

I am following the guide here: https://huggingface.co/docs/transformers/accelerate

Can anyone help??

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
sujit420commented, Aug 23, 2022

Thanks, @muellerzr for the prompt response. It was constructive.

1reaction
muellerzrcommented, Aug 22, 2022

@sujit420 as the error states it wants you to pass in find_unused_params=True. These are from the DDP kwargs and you can pass them in when building your Accelerator object: https://huggingface.co/docs/accelerate/package_reference/kwargs#accelerate.DistributedDataParallelKwargs

Can you try the following when declaring your Accelerator?

from accelerate import DistributedDataParallelKwargs
accelerator = Accelerator(kwargs_handlers=DistributedDataParallelKwargs(find_unused_parameters=True)
Read more comments on GitHub >

github_iconTop Results From Across the Web

LayoutLMv2 model not supporting training on more than 1 ...
LayoutLMv2 model not supporting training on more than 1 GPU when using PyTorch Data Parallel #14110. Open. theMADAIguy opened this issue on ...
Read more >
LayoutLMV2 - Hugging Face
In this paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre-training tasks...
Read more >
Clara Train FAQ - NVIDIA Documentation Center
Why am I encountering an Invalid cross-device link error? 30. Why is my training getting stuck and hanging in federated learning? Appendix ·...
Read more >
Distributed training with TensorFlow
Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can distribute ...
Read more >
Classification Models - Simple Transformers
You can specify the number of classes/labels to use it as a multi-class classifier or as a regression model. Binary classificationPermalink. 1 2 ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found