question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failed to test SOT model on Custom dataset.

See original GitHub issue

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug Testing Stark on custom dataset (VideoCube) failed under multi-gpu testing environment. When the testing almost done, I got the following error:

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>           ] 1126168/1426764, 137.6 task/s, elapsed: 8183s, ETA:  2184s
[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808307 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808307 milliseconds before timing out.

NOTE: This error happens when the testing almost done, and I can train (finetune) the model successfully via:

./tools/dist_train.sh configs/sot/stark/stark_st1_r50_videocube.py 8 --cfg-options model.method='restart'

the loss looks good, so I think my code is OK.

Reproduction

  1. What command or script did you run?
 ./tools/dist_test.sh configs/sot/stark/stark_st1_r50_videocube.py 8 --checkpoint work_dirs/stark_st1_r50_videocube/latest.pth --eval track
  1. Did you make any modifications on the code or config? Did you understand what you have modified?

I add a new config (namely, stark_st1_r50_videocube.py) and import my custom dataset under original stark configs folder. But I think the config is not related to this error.

  1. What dataset did you use and what task did you run?

VideoCube, but I think it is OK because I can train (fine-tune) the model and execute inference (until it failed when testing).

Environment

  1. Please run python mmtrack/utils/collect_env.py to collect necessary environment information and paste it here.
sys.platform: linux
Python: 3.9.11 (main, Mar 29 2022, 19:08:29) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: TITAN RTX
CUDA_HOME: /usr/local/cuda-10.0
NVCC: Cuda compilation tools, release 10.0, V10.0.130
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
PyTorch: 1.11.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.12.0
OpenCV: 4.5.5
MMCV: 1.4.8
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMTracking: 0.13.0+059dc99
  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

The testing is executed in conda environment. I install PyTorch via conda, and install mmdet, mmtrack, mmcv-full via pip.

Error traceback If applicable, paste the error trackback here.

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>           ] 1126168/1426764, 137.6 task/s, elapsed: 8183s, ETA:  2184s[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808307 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808307 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 23037 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 23038 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 23039 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 23040 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 23041 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 23042 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 23035) of binary: /home/user/miniconda3/envs/vpa/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/vpa/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
./tools/test.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-05-11_17:49:23
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 23035)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 23035
======================================================

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

I found a related post in pytorch forum.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
JingweiZhang12commented, May 12, 2022

I think the reason is that your test dataset is too large and it causes a collective operation timeout. Some debugging tips: https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out. I will also keep an eye on this issue.

0reactions
YiqunChen1999commented, Jun 19, 2022

@YiqunChen1999 I met the same problem. I think the reason is the NCCL default timeout is too short to collect results from different GPUs. So, I try to change the default timeout from 30 minutes to 24 hours. You can change the code in mmtracking/tools/test.py line 142 to this; 2022-06-16 21 59 06 After doing this operation, I solve the problem in my enviroment.

Thanks a lot, changing the default timeout solves my problem!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tensorflow Object Detection not learning on custom dataset ...
Am trying tensorflow object detection on a custom dataset, for some reason my model is not learning anything here is a list of...
Read more >
Train and Test Yolov5 on Custom Dataset - YouTube
Explained Practically how to use yolov5 on Custom dataset.Github link: https://github.com/AarohiSingla/yolov5Dataset Used: ...
Read more >
How to Train YOLOv5 on a Custom Dataset, Step by Step
We'll show you the step by step of how to easily train a YOLOv5, by using a complete MLOps end-to-end platform for computer...
Read more >
How to train and use a custom YOLOv7 model
In this tutorial, we examine the new YOLOv7 & its new features, learn how to prepare custom datasets for the model, and then...
Read more >
Training a YOLOv3 Object Detection Model with Custom Dataset
Due to Colab compute limitations, our model fails to train the final weights. This means our model is not as performant as it...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found