question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AssertionError: Fatal error: gradients are inconsistent between workers

See original GitHub issue

I got the following error training a translation task using 16 nodes * 2 GPUs. All my worker tasks completed successfully but master had issues.

Any pointers or help is appreciated. Thanks.

  File "/root/miniconda3/bin/fairseq-train", line 8, in <module>
Traceback (most recent call last):
  File "/root/miniconda3/bin/fairseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 305, in cli_main
    sys.exit(cli_main())
  File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 305, in cli_main
    distributed_main(args.device_id, args)
  File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 284, in distributed_main
    distributed_main(args.device_id, args)
  File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 284, in distributed_main
    main(args, init_distributed=True)
  File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 80, in main
    main(args, init_distributed=True)
  File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 80, in main
    train(args, trainer, task, epoch_itr)
  File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 121, in train
    train(args, trainer, task, epoch_itr)
  File "/root/miniconda3/lib/python3.7/site-packages/fairseq_cli/train.py", line 121, in train
    log_output = trainer.train_step(samples)
  File "/root/miniconda3/lib/python3.7/site-packages/fairseq/trainer.py", line 316, in train_step
    log_output = trainer.train_step(samples)
  File "/root/miniconda3/lib/python3.7/site-packages/fairseq/trainer.py", line 316, in train_step
    ), 'Fatal error: gradients are inconsistent between workers'
AssertionError: Fatal error: gradients are inconsistent between workers
    ), 'Fatal error: gradients are inconsistent between workers'
AssertionError: Fatal error: gradients are inconsistent between workers```

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

10reactions
hanbyul-kimcommented, Jan 15, 2021

Here’s my latest update:

I found out most of the multigpu training problem are related to the memory issue, so I reduced the number of max-tokens and the training was successful.

0reactions
hanbyul-kimcommented, Jan 15, 2021

Solved this issue : use “–use-bmuf” flag in your training config

I’ve got another issue even if I use the --use-bmuf flag.

Traceback (most recent call last):
  File "/home/hanbyu1.kim/fairseq/fairseq/fairseq/trainer.py", line 646, in train_step
    raise FloatingPointError("gradients are Nan/Inf")
FloatingPointError: gradients are Nan/Inf

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hanbyu1.kim/fairseq/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/hanbyu1.kim/fairseq/fairseq/fairseq/distributed_utils.py", line 302, in distributed_main
    main(cfg, **kwargs)
  File "/home/hanbyu1.kim/fairseq/fairseq/fairseq_cli/train.py", line 137, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/usr/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/home/hanbyu1.kim/fairseq/fairseq/fairseq_cli/train.py", line 237, in train
    log_output = trainer.train_step(samples)
  File "/usr/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/home/hanbyu1.kim/fairseq/fairseq/fairseq/trainer.py", line 667, in train_step
    ignore_grad=False,
  File "/home/hanbyu1.kim/fairseq/fairseq/fairseq/tasks/fairseq_task.py", line 428, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/home/hanbyu1.kim/fairseq/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hanbyu1.kim/fairseq/fairseq/fairseq/criterions/label_smoothed_cross_entropy.py", line 79, in forward
    net_output = model(**sample["net_input"])
  File "/home/hanbyu1.kim/fairseq/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hanbyu1.kim/fairseq/fairseq/fairseq/models/speech_to_text/s2t_transformer.py", line 256, in forward
    encoder_out = self.encoder(src_tokens=src_tokens, src_lengths=src_lengths)
  File "/home/hanbyu1.kim/fairseq/lib/python3.6/site-packages/torch/nn/modules/module.py", line 738, in _call_impl
    var = next((v for v in var.values() if isinstance(v, torch.Tensor)))
StopIteration
Read more comments on GitHub >

github_iconTop Results From Across the Web

Fatal error: gradients are inconsistent between workers · Issue ...
I got the following error training a translation task using 16 nodes * 2 GPUs. All my worker tasks completed successfully but master...
Read more >
Fix inconsistent gradient check (4a30a5f6) · Commits - gitlab
'Fatal error: gradients are inconsistent between workers'. assert (. all(norm == prev_norms[0] for norm in prev_norms).
Read more >
DeepStability - A Database of Numerical Methods for Deep ...
Index Library Commit hash Language Type of commit 1 PyTorch ac72881f3ff8c46c2a5cf8b09d02babf46bc4c85 CUDA Fix 2 PyTorch dfc7fa03e5d33f909b9d7853dd001086f5d782a0 Python Fix 3 PyTorch 8e507ad00ebdfd0ae84bc03718e9c2cb74b8573b yaml Fix
Read more >
Amazon - Databricks Knowledge Base
Queries and transformations are encrypted before being send to your clusters. By default, the data exchanged between worker nodes in a cluster is...
Read more >
Unity 2020.2.0a16
Terrain: Fixed an assertion error that occured when painting speed trees in prefab editor when the light is being baked. (1081131).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found