RuntimeError: RuntimeErrorNCCL error
See original GitHub issueDescription
I’m trying to train I3D on my dataset. As soon as I run the training command, I get the RuntimeError: RuntimeErrorNCCL error.
Reproduction
- I’m using the following command to train the I3D model:
bash tools/dist_train.sh configs/recognition/i3d/my_config.py 8
- My config file is given below:
# model settings
model = dict(
type='Recognizer3D',
backbone=dict(
type='ResNet3d',
pretrained2d=True,
pretrained='torchvision://resnet50',
depth=50,
conv1_kernel=(5, 7, 7),
conv1_stride_t=2,
pool1_stride_t=2,
conv_cfg=dict(type='Conv3d'),
norm_eval=False,
inflate=((1, 1, 1), (1, 0, 1, 0), (1, 0, 1, 0, 1, 0), (0, 1, 0)),
zero_init_residual=False),
cls_head=dict(
type='I3DHead',
num_classes=40,
in_channels=2048,
spatial_type='avg',
dropout_ratio=0.5,
init_std=0.01),
# model training and testing settings
train_cfg=None,
test_cfg=dict(average_clips='prob'))
# dataset settings
dataset_type = 'VideoDataset'
data_root = 'data/videos_train'
data_root_val = 'data/videos_val'
ann_file_train = 'data/train_list.txt'
ann_file_val = 'data/val_list.txt'
ann_file_test = 'data/val_list.txt'
###################################
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False)
train_pipeline = [
dict(type='DecordInit'),
dict(type='SampleFrames', clip_len=32, frame_interval=2, num_clips=1),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(
type='MultiScaleCrop',
input_size=224,
scales=(1, 0.8),
random_crop=False,
max_wh_scale_gap=0),
dict(type='Resize', scale=(224, 224), keep_ratio=False),
dict(type='Flip', flip_ratio=0.5),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs', 'label'])
]
val_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=2,
num_clips=1,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='CenterCrop', crop_size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
test_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=2,
num_clips=10,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='ThreeCrop', crop_size=256),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
data = dict(
videos_per_gpu=8,
workers_per_gpu=2,
test_dataloader=dict(videos_per_gpu=1),
train=dict(
type=dataset_type,
ann_file=ann_file_train,
data_prefix=data_root,
pipeline=train_pipeline),
val=dict(
type=dataset_type,
ann_file=ann_file_val,
data_prefix=data_root_val,
pipeline=val_pipeline),
test=dict(
type=dataset_type,
ann_file=ann_file_val,
data_prefix=data_root_val,
pipeline=test_pipeline))
# runtime settings
work_dir = './work_dirs/my_i3d/'
Environment
- I’m working on Google Colab.
- The necessary environment information is given below:
sys.platform: linux
Python: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0]
CUDA available: True
GPU 0: Tesla T4
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.1, V11.1.105
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.12.0
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2022.0-Product Build 20211112 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.3.2 (built against CUDA 11.5)
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.13.0
OpenCV: 4.6.0
MMCV: 1.6.0
MMCV Compiler: n/a
MMCV CUDA Compiler: n/a
MMAction2: 0.24.0+12f16c1
- I used the following command to install PyTorch and CUDA:
conda install pytorch torchvision cudatoolkit=11.3 -c pytorch
Error traceback
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
work = default_pg.broadcast([tensor], opts)work = default_pg.broadcast([tensor], opts)
RuntimeError: RuntimeErrorNCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).:
NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
work = default_pg.broadcast([tensor], opts)
RuntimeErrorwork = default_pg.broadcast([tensor], opts): NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3176) of binary: /usr/local/bin/python
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run
)(*cmd_args)
File "/usr/local/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tools/train.py FAILED
------------------------------------------------------------
Issue Analytics
- State:
- Created a year ago
- Comments:6
Top Results From Across the Web
nccl runtime error: unhandled system error #20313 - GitHub
Bug Got NCCL RuntimeError when startup a distributed training task with 2 nodes. Error code is: Traceback (most recent call last): File ...
Read more >RuntimeError: NCCL Error 2: unhandled system error
This is apparently caused by newer versions of nccl including a data pathway which uses linux shared memory for internode communication (see ...
Read more >NCCL Error 2 when training with 2 GPUs - PyTorch Forums
Hi all, I am training a model with 2 GTX 3090 GPUs. Driver is 455.32.00, CUDA version is 11.1, and torch.cuda.nccl.version() yields 2708....
Read more >Evaluate doesn't play nicely with Accelerate in multi-GPU ...
This crashes with error such as: RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:125, unhandled cuda error, NCCL ...
Read more >RuntimeError: NCCL error in:/torch/csrc/distributed/c10d ...
在NGC集群使用https://github.com/pytorch/examples/blob/main/imagenet/main.py跑ImageNet分布式训练,运行命令是python ma.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@hmhamza The labels should begin at 0 and end at
num_classes - 1.@hmhamza How many GPUs does the Google Colab env have? When you are using this cmd to run the model:
The last number 8 means using 8 GPUs