question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError:Expected to have finished reduction in the prior iteration before starting a new one

See original GitHub issue

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug A clear and concise description of what the bug is.

Reproduction

  1. What command or script did you run? I have change the config name from faster_rcnn_r50_fpn_1x.py to element.py
CUDA_VISIBLE_DEVICES=1,2,3 ./tools/dist_train.sh configs/element.py 3 --autoscale-lr
  1. Did you make any modifications on the code or config? Did you understand what you have modified? only num_classes and work_dir in config

  2. What dataset did you use? my own dataset which is made the same as VOC format Environment image

  3. Please run python mmdet/utils/collect_env.py to collect necessary environment infomation and paste it here.

  4. You may add addition that may be helpful for locating the problem, such as

    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f92f4501441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f92f4500d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x5ec (0x7f92f4de983c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c52bd (0x7f92f4ddf2bd in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x130cfc (0x7f92f484acfc in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallKeywords + 0x1ac (0x4b33ec in /usr/local/bin/python)
frame #6: /usr/local/bin/python() [0x544be8]
frame #7: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #8: /usr/local/bin/python() [0x544a85]
frame #9: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #10: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #11: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #12: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #14: /usr/local/bin/python() [0x544a85]
frame #15: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #16: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #17: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #18: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #19: /usr/local/bin/python() [0x4cf4bf]
frame #20: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #22: /usr/local/bin/python() [0x544a85]
frame #23: PyEval_EvalCodeEx + 0x3e (0x54599e in /usr/local/bin/python)
frame #24: /usr/local/bin/python() [0x489dd6]
frame #25: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #27: /usr/local/bin/python() [0x544a85]
frame #28: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #29: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #30: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #31: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #33: /usr/local/bin/python() [0x544a85]
frame #34: /usr/local/bin/python() [0x544d37]
frame #35: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #36: /usr/local/bin/python() [0x544a85]
frame #37: /usr/local/bin/python() [0x544d37]
frame #38: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #39: /usr/local/bin/python() [0x544a85]
frame #40: /usr/local/bin/python() [0x544d37]
frame #41: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #42: /usr/local/bin/python() [0x5440e1]
frame #43: /usr/local/bin/python() [0x544f91]
frame #44: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #45: /usr/local/bin/python() [0x544a85]
frame #46: PyEval_EvalCode + 0x23 (0x545913 in /usr/local/bin/python)
frame #47: PyRun_FileExFlags + 0x16f (0x42b41f in /usr/local/bin/python)
frame #48: PyRun_SimpleFileExFlags + 0xec (0x42b64c in /usr/local/bin/python)
frame #49: Py_Main + 0xd85 (0x43fa15 in /usr/local/bin/python)
frame #50: main + 0x162 (0x421b62 in /usr/local/bin/python)
frame #51: __libc_start_main + 0xf0 (0x7f92f8173830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #52: _start + 0x29 (0x421c39 in /usr/local/bin/python)

Traceback (most recent call last):
  File "./tools/train.py", line 142, in <module>
    main()
  File "./tools/train.py", line 138, in main
    meta=meta)
  File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 102, in train_detector
    meta=meta)
  File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 171, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/runner.py", line 371, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/runner.py", line 275, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 75, in batch_processor
    losses = model(**data)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 392, in forward
    self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fcaf0f72441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fcaf0f71d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x5ec (0x7fcaf185a83c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c52bd (0x7fcaf18502bd in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x130cfc (0x7fcaf12bbcfc in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallKeywords + 0x1ac (0x4b33ec in /usr/local/bin/python)
frame #6: /usr/local/bin/python() [0x544be8]
frame #7: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #8: /usr/local/bin/python() [0x544a85]
frame #9: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #10: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #11: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #12: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #14: /usr/local/bin/python() [0x544a85]
frame #15: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #16: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #17: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #18: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #19: /usr/local/bin/python() [0x4cf4bf]
frame #20: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #22: /usr/local/bin/python() [0x544a85]
frame #23: PyEval_EvalCodeEx + 0x3e (0x54599e in /usr/local/bin/python)
frame #24: /usr/local/bin/python() [0x489dd6]
frame #25: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #27: /usr/local/bin/python() [0x544a85]
frame #28: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #29: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #30: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #31: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #33: /usr/local/bin/python() [0x544a85]
frame #34: /usr/local/bin/python() [0x544d37]
frame #35: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #36: /usr/local/bin/python() [0x544a85]
frame #37: /usr/local/bin/python() [0x544d37]
frame #38: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #39: /usr/local/bin/python() [0x544a85]
frame #40: /usr/local/bin/python() [0x544d37]
frame #41: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #42: /usr/local/bin/python() [0x5440e1]
frame #43: /usr/local/bin/python() [0x544f91]
frame #44: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #45: /usr/local/bin/python() [0x544a85]
frame #46: PyEval_EvalCode + 0x23 (0x545913 in /usr/local/bin/python)
frame #47: PyRun_FileExFlags + 0x16f (0x42b41f in /usr/local/bin/python)
frame #48: PyRun_SimpleFileExFlags + 0xec (0x42b64c in /usr/local/bin/python)
frame #49: Py_Main + 0xd85 (0x43fa15 in /usr/local/bin/python)
frame #50: main + 0x162 (0x421b62 in /usr/local/bin/python)
frame #51: __libc_start_main + 0xf0 (0x7fcaf4be4830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #52: _start + 0x29 (0x421c39 in /usr/local/bin/python)

^CTraceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 235, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 228, in main
    process.wait()
  File "/usr/lib/python3.6/subprocess.py", line 1457, in wait
    (pid, sts) = self._try_wait(0)
  File "/usr/lib/python3.6/subprocess.py", line 1404, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
root@83403c5335c7:mmdetection_v2# ^C

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:33 (3 by maintainers)

github_iconTop GitHub Comments

8reactions
edgarschnfldcommented, Oct 23, 2020

I met the same issue. But i solved it. The reason is that in my model class, I define a fpn module with 5 level output feature maps in the init function, but in forward function I only use 4 of them. When I use all of them, the problem was solved. This is my supposed conclusion: you should use all output of each module in forward function.

This was helpful. I encountered the same error message in a custom architecture. Here is how you solve it without changing the module: If you define 5 layers, but only use the output of the 4th layer to calculate a specific loss, then you can solve the problem by multiplying the output of the 5th layer with zero and adding it to the loss. This way, you trick pytorch into believing that all parameters contribute to the loss. Problem solved. Deleting the 5th layer is not an option in my case, because I need the output of this layer in most training steps (but not all).

loss = your_loss_function(ouput_layer_4) + 0*output_layer_5.mean()
7reactions
mdv3101commented, May 20, 2020

@SystemErrorWang I am also facing the same problem. When i set find_unused_parameters = cfg.get('find_unused_parameters', True), then the error disappeared, but my training process got stuck.

Read more comments on GitHub >

github_iconTop Results From Across the Web

RuntimeError: Expected to have finished reduction in the prior ...
RuntimeError : Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module ...
Read more >
Expected to have finished reduction in the prior iteration ...
This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) ......
Read more >
Find PyTorch model parameters that don't contribute to loss
"RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module ...
Read more >
runtimeerror: expected to have finished reduction in the prior ...
"RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters...
Read more >
RuntimeError: Expected to have finished reduction in the prior ...
This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found