question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

deadlock using Wandb

See original GitHub issue

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug Hello mmdet developers,

We found the training loop can be dead lock in some places if we use multiGPU training and enable wandb tracking. Single GPU works perfectly fine. I only tested with YOLOX. Please see the command below.

Reproduction

  1. What command or script did you run?
./tools/dist_train.sh ./configs/yolox/yolox_s_8x8_300e_coco.py 2
  1. Did you make any modifications on the code or config? Did you understand what you have modified? No
  2. What dataset did you use? MSCOCO

Environment

  1. Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here. sys.platform: linux Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] CUDA available: True GPU 0,1: Quadro GV100 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.3.r11.3/compiler.29745058_0 GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 PyTorch: 1.10.0 PyTorch compiling details: PyTorch built with:
  • GCC 7.3
  • C++ Version: 201402
  • Intel® oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel® 64 architecture applications
  • Intel® MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.3
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.2
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.0 OpenCV: 4.5.5 MMCV: 1.4.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.3 MMDetection: 2.25.0+ca11860

  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.) We used the provided docker.

Error traceback If applicable, paste the error trackback here.

A placeholder for trackback.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:18 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
MilkCloudscommented, Aug 9, 2022

I experience same phenomena(deadlock over 30 minute) on dyhead/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x_coco.py, only on distributed learning(and more than 1 gpu) setting. 1 gpu training is okay.

0reactions
Fizzezcommented, Dec 11, 2022

Hi @MilkClouds , thank you for the analysis. Glad to see that we have the same option on this. Your solution actually makes more sense by letting MMDetWandbHook’s runner.log_buffer.clear_output() get called. Also thank you for mentioning me in your PR. I really appreciate it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Deadlock found when trying to get lock; try restarting ...
I am running experiments with wandb on a cluster. The wandb client package is installed as part of a shared anaconda environment (shared...
Read more >
Deadlock Resolution with Tree Obs | sparse small_v0
Publish your model insights with interactive plots for performance metrics, ... Test run to check the deadlock resolution wrapper using tree observation.
Read more >
PyTorch Lightning - Documentation - Weights & Biases - WandB
Such a situation can put you in a deadlock because rank 0 process will wait for the non-zero rank processes to join, which...
Read more >
Images logged using W&B logger bloats up /tmp
I have noticed that W&B logger writes the image to a directory inside /tmp , and this directory is only cleared at the...
Read more >
Flatland CCPPO – Weights & Biases
Made by Inna Minashina using Weights & Biases. ... Curiosity + Deadlock Reward + Skip No Choice Cells. . custom_metrics/episode_return_mean.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found