RuntimeError: Expected to mark a variable ready only once.
See original GitHub issueChecklist
- I have searched related issues but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug
The code runs ok with only a single GPU, like the following command
python tools/train.py configs/bevdet/bevdet-sttiny.py
However, when I switch to distributed training:
./tools/dist_train.sh configs/bevdet/bevdet-sttiny.py 8,
the program throws the following error
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forwardfunction. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiplecheckpointfunctions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.3) Incorrect unused parameter detection. The return value of theforwardfunction is inspected by the distributed data parallel wrapper to figure out if any of the module's parameters went unused. For unused parameters, DDP would not expect gradients from then. However, if an unused parameter becomes part of the autograd graph at a later point in time (e.g., in a reentrant backward when usingcheckpoint), the gradient will show up unexpectedly. If all parameters in the model participate in the backward pass, you can disable unused parameter detection by passing the keyword argument find_unused_parameters=Falsetotorch.nn.parallel.DistributedDataParallel.
I’ve read the error massage and found similar post. But their suggested solution is to switch to find_unused_parameters=False. Yet, I have manually checked this argument in mmdetection, and it is set to False by default.
Reproduction
- What command or script did you run?
./tools/dist_train.sh configs/bevdet/bevdet-sttiny.py 8
-
Did you make any modifications on the code or config? Did you understand what you have modified? The only modification I’ve made is delete the registered Swin Transformer in mmdetection, before the definition of self-implemented Swin Transformer in this repo. Specifically, I’ve inserted
del BACKBONES._module_dict['SwinTransformer']before this line Otherwise, MMCV will throw an error because of duplicate model definition. -
What dataset did you use? Nuscenes
Environment
- Please run
python mmdet3d/utils/collect_env.pyto collect necessary environment infomation and paste it here.
Python: 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA TITAN RTX
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.1.TC455_06.29190527_0
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.0
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code
=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.0.5
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden
-DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers
-Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overf
low -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligne
d-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512
=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.9.0
OpenCV: 4.6.0
MMCV: 1.3.18
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.1
MMDetection: 2.17.0
MMSegmentation: 0.18.0
MMDetection3D: 0.17.2+
- You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source] I installed PyTorch within a docker container using pip
Error traceback If applicable, paste the error trackback here.
Traceback (most recent call last):
File "./tools/train.py", line 224, in <module>
main()
File "./tools/train.py", line 213, in main
train_model(
File "/home/users/Code/BEVDet/mmdet3d/apis/train.py", line 28, in train_model
train_detector(
File "/opt/conda/lib/python3.8/site-packages/mmdet/apis/train.py", line 174, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.call_hook('after_train_iter')
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter
runner.outputs['loss'].backward()
File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
Variable._execution_engine.run_backward(
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 89, in apply
return self._forward_cls.backward(self, *args) # type: ignore
File "/opt/conda/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 112, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model param
eters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the sam
e part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not suppo
rt such use cases yet.3) Incorrect unused parameter detection. The return value of the `forward` function is inspected by the distributed data parallel wrapper to figure out if any of the module's param
eters went unused. For unused parameters, DDP would not expect gradients from then. However, if an unused parameter becomes part of the autograd graph at a later point in time (e.g., in a reentrant back
ward when using `checkpoint`), the gradient will show up unexpectedly. If all parameters in the model participate in the backward pass, you can disable unused parameter detection by passing the keyword
argument `find_unused_parameters=False` to `torch.nn.parallel.DistributedDataParallel`.
Issue Analytics
- State:
- Created a year ago
- Comments:6 (3 by maintainers)

Top Related StackOverflow Question
That is not the reason why the program cannot run under distributed training setting. Please read my issue description carefully. Simply delete the registered swin transformer in mmdet (maybe it is a newer version than the one you used) by
del BACKBONES._module_dict['SwinTransformer']and register the swin transformer in this codebase by@BACKBONES.register_module()will have the same effect as using an older mmdet where there is no swin transformer defined.In fact, I have found that turning off the checkpoint (by seting
with_cp=False) can solve my problem. I am not sure why PyTorch checkpoint cannot be used with distributed training. But for those who are facing the same issue, the simplest way is to turn off the checkpoint and cut down the batch size for each gpu.您好,我也遇到了这个问题,我是想替换bevstrereo的backbone,
我使用单卡-batch-size=1也会报这个错误,mmdet是直接需要把更换的backbone的py拷贝到mmdet3d/models/backbone下面,然后在__init__.py中import中引入就可以了吗?非常期待您的回复,感谢。