question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed

See original GitHub issue

Describe the bug ‘nccl.h’ file is not found or ncclUnhandledCudaError: Call to CUDA function failed

To Reproduce Steps to reproduce the behavior:

  1. USE_NCCL=1 python setup.py install

Logs

running install
running bdist_egg
running egg_info
writing fastmoe.egg-info/PKG-INFO
writing dependency_links to fastmoe.egg-info/dependency_links.txt
writing top-level names to fastmoe.egg-info/top_level.txt
reading manifest file 'fastmoe.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'fastmoe.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'fmoe_cuda' extension
Emitting ninja build file /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/7] c++ -MMD -MF /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /home/xinglinpan/miniconda3/envs/fmoe/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o 
c++ -MMD -MF /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /home/xinglinpan/miniconda3/envs/fmoe/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /home/xinglinpan/fastmoe-master/cuda/global_exchange.h:1:0,
                 from /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp:1:
/home/xinglinpan/fastmoe-master/cuda/stream_manager.h:7:18: fatal error: nccl.h: No such file or directory
compilation terminated.
[2/7] /usr/local/cuda/bin/nvcc  -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/balancing.cu -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70
FAILED: /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o 
/usr/local/cuda/bin/nvcc  -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/balancing.cu -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70
In file included from /home/xinglinpan/fastmoe-master/cuda/balancing.cuh:1:0,
                 from /home/xinglinpan/fastmoe-master/cuda/balancing.cu:2:
/home/xinglinpan/fastmoe-master/cuda/stream_manager.h:7:18: fatal error: nccl.h: No such file or directory
compilation terminated.

Try to fix

  1. Download nccl_2.7.8-1+cuda10.2_x86_64
  2. Set environment variables as mentioned
  3. USE_NCCL=1 python setup.py install
Installed /home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg
Processing dependencies for fastmoe==1.0.0
Finished processing dependencies for fastmoe==1.0.0
  1. cd test && pytest test_ddp.py
Traceback (most recent call last):
  File "/home/xinglinpan/fastmoe-master/tests/test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
    torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "/home/xinglinpan/fastmoe-master/tests/test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
    torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

Platform

  • Device: GeForce RTX 2080Ti
  • OS: Linux gpu9 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • CUDA version: 10.2
  • NCCL version: 2.7.8-1
  • PyTorch version: 1.9.1
  • Python Version: 3.8

Additional context

>>> torch.cuda.nccl.version()
2708

May some necessary environment variables be lost during the process of subprocess.Popen?

https://github.com/laekov/fastmoe/blob/670e1407eb1f674a47c45c78567d9217e062caab/tests/test_ddp.py#L44

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
laekovcommented, Jun 10, 2022

I finally begin to understand the issue.

We updated the distributed parameter initialization with bcast over https://github.com/laekov/fastmoe/blob/master/fmoe/distributed.py#L100, which is not correct. In PyTorch’s distributed module, you are supposed to pass a global rank to the broadcast function, and it parses the global rank to local rank itself. (very stupid design, from my view) So, when we have multiple data parallel groups, rank 0 is not in many other comms, which raises this issue.

I will have that fixed later today.

0reactions
Fragile-azaleacommented, Jun 10, 2022

We also try to use docker to fix its problem. Our Command

  1. docker pull pytorch/pytorch:1.9.0-cuda10.2-cudnn7-devel
  2. docker run -it --gpus all -v /home/xinglinpan/fastmoe-master:/fastmoe --ipc=host pytorch/pytorch:1.9.0-cuda10.2-cudnn7-devel /bin/bash
  3. pip install dm-tree ninja pytest
  4. rm -rf /fastmoe/build
  5. rm -rf /fastmoe/fastmoe.egg-info
  6. USE_NCCL=1 python setup.py install
  7. python demo.py
  8. python test_ddp.py

log

// demo.py
tensor([ 9, 10, 11, 12], device='cuda:2')
tensor([5, 6, 7, 8], device='cuda:1')
tensor([1, 2, 3, 4], device='cuda:0')
tensor([13, 14, 15, 16], device='cuda:3')
[tensor([1, 2, 3, 4], device='cuda:3'), tensor([5, 6, 7, 8], device='cuda:3'), tensor([ 9, 10, 11, 12], device='cuda:3'), tensor([13, 14, 15, 16], device='cuda:3')]
[tensor([1, 2, 3, 4], device='cuda:1'), tensor([5, 6, 7, 8], device='cuda:1'), tensor([ 9, 10, 11, 12], device='cuda:1'), tensor([13, 14, 15, 16], device='cuda:1')]
[tensor([1, 2, 3, 4], device='cuda:0'), tensor([5, 6, 7, 8], device='cuda:0'), tensor([ 9, 10, 11, 12], device='cuda:0'), tensor([13, 14, 15, 16], device='cuda:0')]
[tensor([1, 2, 3, 4], device='cuda:2'), tensor([5, 6, 7, 8], device='cuda:2'), tensor([ 9, 10, 11, 12], device='cuda:2'), tensor([13, 14, 15, 16], device='cuda:2')]
// test_ddp.py
4
44a3b6d368a5:100:100 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:100:100 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

44a3b6d368a5:100:100 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
44a3b6d368a5:100:100 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:100:100 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
Traceback (most recent call last):
  File "test_ddp.py", line 140, in <module>
    locals()[sys.argv[1]](**args)
  File "/fastmoe/tests/test_numerical.py", line 346, in _test_fmoe_local_ddp
    mp_group=mp_group, dp_group=dp_group, world_group=world_group)
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 80, in __init__
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 100, in _sync_params
    torch.distributed.broadcast(coalesced, 0, group=comm)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1078, in broadcast
    group_src_rank = _get_group_rank(group, src)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 250, in _get_group_rank
    raise RuntimeError(f"The global rank {rank} is not part of the group {group}") from None
RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7f2b887d5e70>
44a3b6d368a5:102:102 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:102:102 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

44a3b6d368a5:102:102 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
44a3b6d368a5:102:102 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:102:102 [0] NCCL INFO Using network Socket
44a3b6d368a5:100:146 [0] NCCL INFO Channel 00/02 :    0   1
44a3b6d368a5:102:147 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
44a3b6d368a5:100:146 [0] NCCL INFO Channel 01/02 :    0   1
44a3b6d368a5:102:147 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
44a3b6d368a5:102:147 [0] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
44a3b6d368a5:100:146 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
44a3b6d368a5:100:146 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
44a3b6d368a5:100:146 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
44a3b6d368a5:102:147 [0] NCCL INFO Channel 00 : 1[b1000] -> 0[3d000] via direct shared memory
44a3b6d368a5:100:146 [0] NCCL INFO Channel 00 : 0[3d000] -> 1[b1000] via direct shared memory
44a3b6d368a5:100:146 [0] NCCL INFO Channel 01 : 0[3d000] -> 1[b1000] via direct shared memory
44a3b6d368a5:102:147 [0] NCCL INFO Channel 01 : 1[b1000] -> 0[3d000] via direct shared memory
44a3b6d368a5:100:146 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
44a3b6d368a5:100:146 [0] NCCL INFO comm 0x7fbb78001060 rank 0 nranks 2 cudaDev 0 busId 3d000 - Init COMPLETE
44a3b6d368a5:102:147 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
44a3b6d368a5:100:100 [0] NCCL INFO Launch mode Parallel
44a3b6d368a5:102:147 [0] NCCL INFO comm 0x7f6c68001060 rank 1 nranks 2 cudaDev 0 busId b1000 - Init COMPLETE
Traceback (most recent call last):
  File "test_ddp.py", line 140, in <module>
    locals()[sys.argv[1]](**args)
  File "/fastmoe/tests/test_numerical.py", line 346, in _test_fmoe_local_ddp
    mp_group=mp_group, dp_group=dp_group, world_group=world_group)
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 80, in __init__
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 100, in _sync_params
    torch.distributed.broadcast(coalesced, 0, group=comm)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1078, in broadcast
    group_src_rank = _get_group_rank(group, src)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 250, in _get_group_rank
    raise RuntimeError(f"The global rank {rank} is not part of the group {group}") from None
RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7ff9018b9f70>
NCCL version 2.7.8+cuda10.2
Read more comments on GitHub >

github_iconTop Results From Across the Web

How to solve the famous `unhandled cuda error, NCCL ...
8 ncclUnhandledCudaError: Call to CUDA function failed. Bonus 1: I still have errors: ncclSystemError: System call (socket, malloc, munmap, etc) ...
Read more >
Pytorch version incompatible with cuda
ncclUnhandledCudaError : Call to CUDA function failed. torch version: '1.11.0+cu102' cuda version: 11.4 driver: 470.63.01. Could you please tell ...
Read more >
RuntimeError: NCCL Error 1: unhandled cuda error #11756
Issue description Get NCCL Error 1: unhandled cuda error when using ... version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.
Read more >
ncclAllReduce failed: unhandled cuda error - DGX User Forum
We are using the experimental_compile option for some tf.function in the training code, so the code is partially compiled with XLA. We get...
Read more >
DESIGNING OPTIMIZED MPI+NCCL HYBRID COLLECTIVE ...
ncclUnhandledCudaError. 1. A call to a CUDA function failed. ncclSystemError. 2. A call to the system failed. ncclInternalError. 3. An internal check failed....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found