question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: Expected to mark a variable ready only once.

See original GitHub issue

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug The code runs ok with only a single GPU, like the following command python tools/train.py configs/bevdet/bevdet-sttiny.py However, when I switch to distributed training: ./tools/dist_train.sh configs/bevdet/bevdet-sttiny.py 8, the program throws the following error RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forwardfunction. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiplecheckpointfunctions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.3) Incorrect unused parameter detection. The return value of theforwardfunction is inspected by the distributed data parallel wrapper to figure out if any of the module's parameters went unused. For unused parameters, DDP would not expect gradients from then. However, if an unused parameter becomes part of the autograd graph at a later point in time (e.g., in a reentrant backward when usingcheckpoint), the gradient will show up unexpectedly. If all parameters in the model participate in the backward pass, you can disable unused parameter detection by passing the keyword argument find_unused_parameters=Falsetotorch.nn.parallel.DistributedDataParallel. I’ve read the error massage and found similar post. But their suggested solution is to switch to find_unused_parameters=False. Yet, I have manually checked this argument in mmdetection, and it is set to False by default.

Reproduction

  1. What command or script did you run?
./tools/dist_train.sh configs/bevdet/bevdet-sttiny.py 8
  1. Did you make any modifications on the code or config? Did you understand what you have modified? The only modification I’ve made is delete the registered Swin Transformer in mmdetection, before the definition of self-implemented Swin Transformer in this repo. Specifically, I’ve inserted del BACKBONES._module_dict['SwinTransformer'] before this line Otherwise, MMCV will throw an error because of duplicate model definition.

  2. What dataset did you use? Nuscenes

Environment

  1. Please run python mmdet3d/utils/collect_env.py to collect necessary environment infomation and paste it here.
Python: 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA TITAN RTX
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.1.TC455_06.29190527_0
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code
=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden
-DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers
 -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overf
low -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligne
d-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512
=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0
OpenCV: 4.6.0
MMCV: 1.3.18
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.1
MMDetection: 2.17.0
MMSegmentation: 0.18.0
MMDetection3D: 0.17.2+
  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source] I installed PyTorch within a docker container using pip

Error traceback If applicable, paste the error trackback here.

Traceback (most recent call last):
  File "./tools/train.py", line 224, in <module>
    main()
  File "./tools/train.py", line 213, in main
    train_model(
  File "/home/users/Code/BEVDet/mmdet3d/apis/train.py", line 28, in train_model
    train_detector(
  File "/opt/conda/lib/python3.8/site-packages/mmdet/apis/train.py", line 174, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter
    runner.outputs['loss'].backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 89, in apply
    return self._forward_cls.backward(self, *args)  # type: ignore
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 112, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model param
eters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the sam
e part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not suppo
rt such use cases yet.3) Incorrect unused parameter detection. The return value of the `forward` function is inspected by the distributed data parallel wrapper to figure out if any of the module's param
eters went unused. For unused parameters, DDP would not expect gradients from then. However, if an unused parameter becomes part of the autograd graph at a later point in time (e.g., in a reentrant back
ward when using `checkpoint`), the gradient will show up unexpectedly. If all parameters in the model participate in the backward pass, you can disable unused parameter detection by passing the keyword
argument `find_unused_parameters=False` to `torch.nn.parallel.DistributedDataParallel`.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
zeyuwang615commented, Jul 4, 2022

That is not the reason why the program cannot run under distributed training setting. Please read my issue description carefully. Simply delete the registered swin transformer in mmdet (maybe it is a newer version than the one you used) by del BACKBONES._module_dict['SwinTransformer'] and register the swin transformer in this codebase by @BACKBONES.register_module() will have the same effect as using an older mmdet where there is no swin transformer defined.

In fact, I have found that turning off the checkpoint (by seting with_cp=False) can solve my problem. I am not sure why PyTorch checkpoint cannot be used with distributed training. But for those who are facing the same issue, the simplest way is to turn off the checkpoint and cut down the batch size for each gpu.

0reactions
yukaizhoucommented, Oct 27, 2022

That is not the reason why the program cannot run under distributed training setting. Please read my issue description carefully. Simply delete the registered swin transformer in mmdet (maybe it is a newer version than the one you used) by del BACKBONES._module_dict['SwinTransformer'] and register the swin transformer in this codebase by @BACKBONES.register_module() will have the same effect as using an older mmdet where there is no swin transformer defined.

In fact, I have found that turning off the checkpoint (by seting with_cp=False) can solve my problem. I am not sure why PyTorch checkpoint cannot be used with distributed training. But for those who are facing the same issue, the simplest way is to turn off the checkpoint and cut down the batch size for each gpu。

您好,我也遇到了这个问题,我是想替换bevstrereo的backbone,

我使用单卡-batch-size=1也会报这个错误,mmdet是直接需要把更换的backbone的py拷贝到mmdet3d/models/backbone下面,然后在__init__.py中import中引入就可以了吗?非常期待您的回复,感谢。

Read more comments on GitHub >

github_iconTop Results From Across the Web

Expected to mark a variable ready only once - distributed
RuntimeError : Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of...
Read more >
PyTorch DDP: Finding the cause of "Expected to mark a ...
RuntimeError : Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of...
Read more >
RuntimeError: Expected to mark a variable ready only once ...
RuntimeError : Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of...
Read more >
ddp_find_unused_parameters_f...
RuntimeError : Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of...
Read more >
Model requirement for multi-gpu training - Technical Support
RuntimeError : Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found