Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed

See original GitHub issue

Describe the bug ‘nccl.h’ file is not found or ncclUnhandledCudaError: Call to CUDA function failed

To Reproduce Steps to reproduce the behavior:

USE_NCCL=1 python setup.py install

Logs

running install
running bdist_egg
running egg_info
writing fastmoe.egg-info/PKG-INFO
writing dependency_links to fastmoe.egg-info/dependency_links.txt
writing top-level names to fastmoe.egg-info/top_level.txt
reading manifest file 'fastmoe.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'fastmoe.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'fmoe_cuda' extension
Emitting ninja build file /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/7] c++ -MMD -MF /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /home/xinglinpan/miniconda3/envs/fmoe/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o 
c++ -MMD -MF /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /home/xinglinpan/miniconda3/envs/fmoe/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /home/xinglinpan/fastmoe-master/cuda/global_exchange.h:1:0,
                 from /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp:1:
/home/xinglinpan/fastmoe-master/cuda/stream_manager.h:7:18: fatal error: nccl.h: No such file or directory
compilation terminated.
[2/7] /usr/local/cuda/bin/nvcc  -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/balancing.cu -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70
FAILED: /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o 
/usr/local/cuda/bin/nvcc  -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/balancing.cu -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70
In file included from /home/xinglinpan/fastmoe-master/cuda/balancing.cuh:1:0,
                 from /home/xinglinpan/fastmoe-master/cuda/balancing.cu:2:
/home/xinglinpan/fastmoe-master/cuda/stream_manager.h:7:18: fatal error: nccl.h: No such file or directory
compilation terminated.

Try to fix

Download nccl_2.7.8-1+cuda10.2_x86_64
Set environment variables as mentioned
USE_NCCL=1 python setup.py install

Installed /home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg
Processing dependencies for fastmoe==1.0.0
Finished processing dependencies for fastmoe==1.0.0

cd test && pytest test_ddp.py

Traceback (most recent call last):
  File "/home/xinglinpan/fastmoe-master/tests/test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
    torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "/home/xinglinpan/fastmoe-master/tests/test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
    torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

Platform

Device: GeForce RTX 2080Ti
OS: Linux gpu9 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
CUDA version: 10.2
NCCL version: 2.7.8-1
PyTorch version: 1.9.1
Python Version: 3.8

Additional context

>>> torch.cuda.nccl.version()
2708

May some necessary environment variables be lost during the process of subprocess.Popen?

https://github.com/laekov/fastmoe/blob/670e1407eb1f674a47c45c78567d9217e062caab/tests/test_ddp.py#L44

Issue Analytics

State:
Created a year ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

laekovcommented, Jun 10, 2022

I finally begin to understand the issue.

We updated the distributed parameter initialization with bcast over https://github.com/laekov/fastmoe/blob/master/fmoe/distributed.py#L100, which is not correct. In PyTorch’s distributed module, you are supposed to pass a global rank to the broadcast function, and it parses the global rank to local rank itself. (very stupid design, from my view) So, when we have multiple data parallel groups, rank 0 is not in many other comms, which raises this issue.

I will have that fixed later today.

0reactions

Fragile-azaleacommented, Jun 10, 2022

We also try to use docker to fix its problem. Our Command

docker pull pytorch/pytorch:1.9.0-cuda10.2-cudnn7-devel
docker run -it --gpus all -v /home/xinglinpan/fastmoe-master:/fastmoe --ipc=host pytorch/pytorch:1.9.0-cuda10.2-cudnn7-devel /bin/bash
pip install dm-tree ninja pytest
rm -rf /fastmoe/build
rm -rf /fastmoe/fastmoe.egg-info
USE_NCCL=1 python setup.py install
python demo.py
python test_ddp.py

log

// demo.py
tensor([ 9, 10, 11, 12], device='cuda:2')
tensor([5, 6, 7, 8], device='cuda:1')
tensor([1, 2, 3, 4], device='cuda:0')
tensor([13, 14, 15, 16], device='cuda:3')
[tensor([1, 2, 3, 4], device='cuda:3'), tensor([5, 6, 7, 8], device='cuda:3'), tensor([ 9, 10, 11, 12], device='cuda:3'), tensor([13, 14, 15, 16], device='cuda:3')]
[tensor([1, 2, 3, 4], device='cuda:1'), tensor([5, 6, 7, 8], device='cuda:1'), tensor([ 9, 10, 11, 12], device='cuda:1'), tensor([13, 14, 15, 16], device='cuda:1')]
[tensor([1, 2, 3, 4], device='cuda:0'), tensor([5, 6, 7, 8], device='cuda:0'), tensor([ 9, 10, 11, 12], device='cuda:0'), tensor([13, 14, 15, 16], device='cuda:0')]
[tensor([1, 2, 3, 4], device='cuda:2'), tensor([5, 6, 7, 8], device='cuda:2'), tensor([ 9, 10, 11, 12], device='cuda:2'), tensor([13, 14, 15, 16], device='cuda:2')]

// test_ddp.py
4
44a3b6d368a5:100:100 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:100:100 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

44a3b6d368a5:100:100 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
44a3b6d368a5:100:100 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:100:100 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
Traceback (most recent call last):
  File "test_ddp.py", line 140, in <module>
    locals()[sys.argv[1]](**args)
  File "/fastmoe/tests/test_numerical.py", line 346, in _test_fmoe_local_ddp
    mp_group=mp_group, dp_group=dp_group, world_group=world_group)
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 80, in __init__
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 100, in _sync_params
    torch.distributed.broadcast(coalesced, 0, group=comm)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1078, in broadcast
    group_src_rank = _get_group_rank(group, src)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 250, in _get_group_rank
    raise RuntimeError(f"The global rank {rank} is not part of the group {group}") from None
RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7f2b887d5e70>
44a3b6d368a5:102:102 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:102:102 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

44a3b6d368a5:102:102 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
44a3b6d368a5:102:102 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:102:102 [0] NCCL INFO Using network Socket
44a3b6d368a5:100:146 [0] NCCL INFO Channel 00/02 :    0   1
44a3b6d368a5:102:147 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
44a3b6d368a5:100:146 [0] NCCL INFO Channel 01/02 :    0   1
44a3b6d368a5:102:147 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
44a3b6d368a5:102:147 [0] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
44a3b6d368a5:100:146 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
44a3b6d368a5:100:146 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
44a3b6d368a5:100:146 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
44a3b6d368a5:102:147 [0] NCCL INFO Channel 00 : 1[b1000] -> 0[3d000] via direct shared memory
44a3b6d368a5:100:146 [0] NCCL INFO Channel 00 : 0[3d000] -> 1[b1000] via direct shared memory
44a3b6d368a5:100:146 [0] NCCL INFO Channel 01 : 0[3d000] -> 1[b1000] via direct shared memory
44a3b6d368a5:102:147 [0] NCCL INFO Channel 01 : 1[b1000] -> 0[3d000] via direct shared memory
44a3b6d368a5:100:146 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
44a3b6d368a5:100:146 [0] NCCL INFO comm 0x7fbb78001060 rank 0 nranks 2 cudaDev 0 busId 3d000 - Init COMPLETE
44a3b6d368a5:102:147 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
44a3b6d368a5:100:100 [0] NCCL INFO Launch mode Parallel
44a3b6d368a5:102:147 [0] NCCL INFO comm 0x7f6c68001060 rank 1 nranks 2 cudaDev 0 busId b1000 - Init COMPLETE
Traceback (most recent call last):
  File "test_ddp.py", line 140, in <module>
    locals()[sys.argv[1]](**args)
  File "/fastmoe/tests/test_numerical.py", line 346, in _test_fmoe_local_ddp
    mp_group=mp_group, dp_group=dp_group, world_group=world_group)
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 80, in __init__
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 100, in _sync_params
    torch.distributed.broadcast(coalesced, 0, group=comm)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1078, in broadcast
    group_src_rank = _get_group_rank(group, src)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 250, in _get_group_rank
    raise RuntimeError(f"The global rank {rank} is not part of the group {group}") from None
RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7ff9018b9f70>
NCCL version 2.7.8+cuda10.2

Top Results From Across the Web

How to solve the famous `unhandled cuda error, NCCL ...

8 ncclUnhandledCudaError: Call to CUDA function failed. Bonus 1: I still have errors: ncclSystemError: System call (socket, malloc, munmap, etc) ...

Pytorch version incompatible with cuda

ncclUnhandledCudaError : Call to CUDA function failed. torch version: '1.11.0+cu102' cuda version: 11.4 driver: 470.63.01. Could you please tell ...

RuntimeError: NCCL Error 1: unhandled cuda error #11756

Issue description Get NCCL Error 1: unhandled cuda error when using ... version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.

ncclAllReduce failed: unhandled cuda error - DGX User Forum

We are using the experimental_compile option for some tf.function in the training code, so the code is partially compiled with XLA. We get...

DESIGNING OPTIMIZED MPI+NCCL HYBRID COLLECTIVE ...

ncclUnhandledCudaError. 1. A call to a CUDA function failed. ncclSystemError. 2. A call to the system failed. ncclInternalError. 3. An internal check failed....