Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failed to test SOT model on Custom dataset.

See original GitHub issue

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug Testing Stark on custom dataset (VideoCube) failed under multi-gpu testing environment. When the testing almost done, I got the following error:

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>           ] 1126168/1426764, 137.6 task/s, elapsed: 8183s, ETA:  2184s
[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808307 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808307 milliseconds before timing out.

NOTE: This error happens when the testing almost done, and I can train (finetune) the model successfully via:

./tools/dist_train.sh configs/sot/stark/stark_st1_r50_videocube.py 8 --cfg-options model.method='restart'

the loss looks good, so I think my code is OK.

Reproduction

What command or script did you run?

 ./tools/dist_test.sh configs/sot/stark/stark_st1_r50_videocube.py 8 --checkpoint work_dirs/stark_st1_r50_videocube/latest.pth --eval track

Did you make any modifications on the code or config? Did you understand what you have modified?

I add a new config (namely, stark_st1_r50_videocube.py) and import my custom dataset under original stark configs folder. But I think the config is not related to this error.

What dataset did you use and what task did you run?

VideoCube, but I think it is OK because I can train (fine-tune) the model and execute inference (until it failed when testing).

Environment

Please run python mmtrack/utils/collect_env.py to collect necessary environment information and paste it here.

sys.platform: linux
Python: 3.9.11 (main, Mar 29 2022, 19:08:29) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: TITAN RTX
CUDA_HOME: /usr/local/cuda-10.0
NVCC: Cuda compilation tools, release 10.0, V10.0.130
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
PyTorch: 1.11.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.12.0
OpenCV: 4.5.5
MMCV: 1.4.8
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMTracking: 0.13.0+059dc99

You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
- Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

The testing is executed in conda environment. I install PyTorch via conda, and install mmdet, mmtrack, mmcv-full via pip.

Error traceback If applicable, paste the error trackback here.

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>           ] 1126168/1426764, 137.6 task/s, elapsed: 8183s, ETA:  2184s[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808307 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808307 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 23037 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 23038 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 23039 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 23040 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 23041 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 23042 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 23035) of binary: /home/user/miniconda3/envs/vpa/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
./tools/test.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-05-11_17:49:23
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 23035)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 23035
======================================================

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

I found a related post in pytorch forum.

Issue Analytics

State:
Created a year ago
Comments:6

Top GitHub Comments

1reaction

JingweiZhang12commented, May 12, 2022

I think the reason is that your test dataset is too large and it causes a collective operation timeout. Some debugging tips: https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out. I will also keep an eye on this issue.

0reactions

YiqunChen1999commented, Jun 19, 2022

@YiqunChen1999 I met the same problem. I think the reason is the NCCL default timeout is too short to collect results from different GPUs. So, I try to change the default timeout from 30 minutes to 24 hours. You can change the code in mmtracking/tools/test.py line 142 to this; After doing this operation, I solve the problem in my enviroment.