Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AssertionError: Default process group is not initialized

See original GitHub issue

Describe the bug python tools/train.py configs/danet/danet_r50-d8_512x1024_40k_cityscapes.py. I get an error when using custom data for model training, AssertionError: Default process group is not initialized. GPU now has two target detection networks running, is this the reason? mmdetection can train multiple networks simultaneously.

Environment info sys.platform: linux Python: 3.7.7 (default, Mar 23 2020, 22:36:06) [GCC 7.3.0] CUDA available: True CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.1, V10.1.243 GPU 0: Tesla V100-PCIE-32GB GCC: gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0 PyTorch: 1.5.0 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel® Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel® 64 architecture applications
Intel® MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.3
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.6.0a0+82fd1c8 OpenCV: 4.2.0 MMCV: 1.0.2 MMSegmentation: 0.5.0+b72a6d0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1

Issue Analytics

State:
Created 3 years ago
Comments:8

Top GitHub Comments

6reactions

xvjiaruicommented, Jul 11, 2020

Hi @HaoweiGis If you would like to debug with non-distributed training, you need to change SyncBNto BN since distributed training is required by PyTorch SyncBN.

0reactions

xiexinchcommented, Apr 20, 2021

facing same issue, which file under configs folder?

Hi @PriyankaJain-1998 At each config/_base_/models/xxx.py. And you can also run tools/dist_train.sh by setting GPUS=1, like ./tools/dist_train.sh config.py 1