AssertionError: Fatal error: gradients are inconsistent between workers
See original GitHub issueI got the following error training a translation task using 16 nodes * 2 GPUs. All my worker tasks completed successfully but master had issues.
Any pointers or help is appreciated. Thanks.
File "/root/miniconda3/bin/fairseq-train", line 8, in <module>
Traceback (most recent call last):
File "/root/miniconda3/bin/fairseq-train", line 8, in <module>
sys.exit(cli_main())
File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 305, in cli_main
sys.exit(cli_main())
File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 305, in cli_main
distributed_main(args.device_id, args)
File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 284, in distributed_main
distributed_main(args.device_id, args)
File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 284, in distributed_main
main(args, init_distributed=True)
File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 80, in main
main(args, init_distributed=True)
File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 80, in main
train(args, trainer, task, epoch_itr)
File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 121, in train
train(args, trainer, task, epoch_itr)
File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 121, in train
log_output = trainer.train_step(samples)
File "/root/miniconda3/lib/python3.7/site-packages/fairseq/trainer.py", line 316, in train_step
log_output = trainer.train_step(samples)
File "/root/miniconda3/lib/python3.7/site-packages/fairseq/trainer.py", line 316, in train_step
), 'Fatal error: gradients are inconsistent between workers'
AssertionError: Fatal error: gradients are inconsistent between workers
), 'Fatal error: gradients are inconsistent between workers'
AssertionError: Fatal error: gradients are inconsistent between workers```
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Fatal error: gradients are inconsistent between workers · Issue ...
I got the following error training a translation task using 16 nodes * 2 GPUs. All my worker tasks completed successfully but master...
Read more >Fix inconsistent gradient check (4a30a5f6) · Commits - gitlab
'Fatal error: gradients are inconsistent between workers'. assert (. all(norm == prev_norms[0] for norm in prev_norms).
Read more >DeepStability - A Database of Numerical Methods for Deep ...
Index Library Commit hash Language Type of commit
1 PyTorch ac72881f3ff8c46c2a5cf8b09d02babf46bc4c85 CUDA Fix
2 PyTorch dfc7fa03e5d33f909b9d7853dd001086f5d782a0 Python Fix
3 PyTorch 8e507ad00ebdfd0ae84bc03718e9c2cb74b8573b yaml Fix
Read more >Amazon - Databricks Knowledge Base
Queries and transformations are encrypted before being send to your clusters. By default, the data exchanged between worker nodes in a cluster is...
Read more >Unity 2020.2.0a16
Terrain: Fixed an assertion error that occured when painting speed trees in prefab editor when the light is being baked. (1081131).
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Here’s my latest update:
I found out most of the multigpu training problem are related to the memory issue, so I reduced the number of max-tokens and the training was successful.
I’ve got another issue even if I use the
--use-bmuf
flag.