RuntimeError:Expected to have finished reduction in the prior iteration before starting a new one
See original GitHub issueThanks for your error report and we appreciate it a lot.
Checklist
- I have searched related issues but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug A clear and concise description of what the bug is.
Reproduction
- What command or script did you run? I have change the config name from faster_rcnn_r50_fpn_1x.py to element.py
CUDA_VISIBLE_DEVICES=1,2,3 ./tools/dist_train.sh configs/element.py 3 --autoscale-lr
-
Did you make any modifications on the code or config? Did you understand what you have modified? only num_classes and work_dir in config
-
What dataset did you use? my own dataset which is made the same as VOC format Environment
-
Please run
python mmdet/utils/collect_env.py
to collect necessary environment infomation and paste it here. -
You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
- Other environment variables that may be related (such as
$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)
Error traceback
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f92f4501441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f92f4500d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x5ec (0x7f92f4de983c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c52bd (0x7f92f4ddf2bd in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x130cfc (0x7f92f484acfc in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallKeywords + 0x1ac (0x4b33ec in /usr/local/bin/python)
frame #6: /usr/local/bin/python() [0x544be8]
frame #7: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #8: /usr/local/bin/python() [0x544a85]
frame #9: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #10: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #11: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #12: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #14: /usr/local/bin/python() [0x544a85]
frame #15: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #16: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #17: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #18: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #19: /usr/local/bin/python() [0x4cf4bf]
frame #20: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #22: /usr/local/bin/python() [0x544a85]
frame #23: PyEval_EvalCodeEx + 0x3e (0x54599e in /usr/local/bin/python)
frame #24: /usr/local/bin/python() [0x489dd6]
frame #25: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #27: /usr/local/bin/python() [0x544a85]
frame #28: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #29: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #30: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #31: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #33: /usr/local/bin/python() [0x544a85]
frame #34: /usr/local/bin/python() [0x544d37]
frame #35: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #36: /usr/local/bin/python() [0x544a85]
frame #37: /usr/local/bin/python() [0x544d37]
frame #38: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #39: /usr/local/bin/python() [0x544a85]
frame #40: /usr/local/bin/python() [0x544d37]
frame #41: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #42: /usr/local/bin/python() [0x5440e1]
frame #43: /usr/local/bin/python() [0x544f91]
frame #44: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #45: /usr/local/bin/python() [0x544a85]
frame #46: PyEval_EvalCode + 0x23 (0x545913 in /usr/local/bin/python)
frame #47: PyRun_FileExFlags + 0x16f (0x42b41f in /usr/local/bin/python)
frame #48: PyRun_SimpleFileExFlags + 0xec (0x42b64c in /usr/local/bin/python)
frame #49: Py_Main + 0xd85 (0x43fa15 in /usr/local/bin/python)
frame #50: main + 0x162 (0x421b62 in /usr/local/bin/python)
frame #51: __libc_start_main + 0xf0 (0x7f92f8173830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #52: _start + 0x29 (0x421c39 in /usr/local/bin/python)
Traceback (most recent call last):
File "./tools/train.py", line 142, in <module>
main()
File "./tools/train.py", line 138, in main
meta=meta)
File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 102, in train_detector
meta=meta)
File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 171, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/runner.py", line 371, in run
epoch_runner(data_loaders[i], **kwargs)
File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/runner.py", line 275, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 75, in batch_processor
losses = model(**data)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 392, in forward
self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fcaf0f72441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fcaf0f71d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x5ec (0x7fcaf185a83c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c52bd (0x7fcaf18502bd in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x130cfc (0x7fcaf12bbcfc in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallKeywords + 0x1ac (0x4b33ec in /usr/local/bin/python)
frame #6: /usr/local/bin/python() [0x544be8]
frame #7: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #8: /usr/local/bin/python() [0x544a85]
frame #9: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #10: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #11: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #12: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #14: /usr/local/bin/python() [0x544a85]
frame #15: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #16: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #17: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #18: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #19: /usr/local/bin/python() [0x4cf4bf]
frame #20: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #22: /usr/local/bin/python() [0x544a85]
frame #23: PyEval_EvalCodeEx + 0x3e (0x54599e in /usr/local/bin/python)
frame #24: /usr/local/bin/python() [0x489dd6]
frame #25: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #27: /usr/local/bin/python() [0x544a85]
frame #28: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #29: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #30: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #31: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #33: /usr/local/bin/python() [0x544a85]
frame #34: /usr/local/bin/python() [0x544d37]
frame #35: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #36: /usr/local/bin/python() [0x544a85]
frame #37: /usr/local/bin/python() [0x544d37]
frame #38: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #39: /usr/local/bin/python() [0x544a85]
frame #40: /usr/local/bin/python() [0x544d37]
frame #41: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #42: /usr/local/bin/python() [0x5440e1]
frame #43: /usr/local/bin/python() [0x544f91]
frame #44: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #45: /usr/local/bin/python() [0x544a85]
frame #46: PyEval_EvalCode + 0x23 (0x545913 in /usr/local/bin/python)
frame #47: PyRun_FileExFlags + 0x16f (0x42b41f in /usr/local/bin/python)
frame #48: PyRun_SimpleFileExFlags + 0xec (0x42b64c in /usr/local/bin/python)
frame #49: Py_Main + 0xd85 (0x43fa15 in /usr/local/bin/python)
frame #50: main + 0x162 (0x421b62 in /usr/local/bin/python)
frame #51: __libc_start_main + 0xf0 (0x7fcaf4be4830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #52: _start + 0x29 (0x421c39 in /usr/local/bin/python)
^CTraceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 235, in <module>
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 228, in main
process.wait()
File "/usr/lib/python3.6/subprocess.py", line 1457, in wait
(pid, sts) = self._try_wait(0)
File "/usr/lib/python3.6/subprocess.py", line 1404, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
root@83403c5335c7:mmdetection_v2# ^C
Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
Issue Analytics
- State:
- Created 4 years ago
- Comments:33 (3 by maintainers)
Top GitHub Comments
This was helpful. I encountered the same error message in a custom architecture. Here is how you solve it without changing the module: If you define 5 layers, but only use the output of the 4th layer to calculate a specific loss, then you can solve the problem by multiplying the output of the 5th layer with zero and adding it to the loss. This way, you trick pytorch into believing that all parameters contribute to the loss. Problem solved. Deleting the 5th layer is not an option in my case, because I need the output of this layer in most training steps (but not all).
@SystemErrorWang I am also facing the same problem. When i set
find_unused_parameters = cfg.get('find_unused_parameters', True)
, then the error disappeared, but my training process got stuck.