nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed
See original GitHub issueDescribe the bug ‘nccl.h’ file is not found or ncclUnhandledCudaError: Call to CUDA function failed
To Reproduce Steps to reproduce the behavior:
- USE_NCCL=1 python setup.py install
Logs
running install
running bdist_egg
running egg_info
writing fastmoe.egg-info/PKG-INFO
writing dependency_links to fastmoe.egg-info/dependency_links.txt
writing top-level names to fastmoe.egg-info/top_level.txt
reading manifest file 'fastmoe.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'fastmoe.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'fmoe_cuda' extension
Emitting ninja build file /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/7] c++ -MMD -MF /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /home/xinglinpan/miniconda3/envs/fmoe/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o
c++ -MMD -MF /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /home/xinglinpan/miniconda3/envs/fmoe/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /home/xinglinpan/fastmoe-master/cuda/global_exchange.h:1:0,
from /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp:1:
/home/xinglinpan/fastmoe-master/cuda/stream_manager.h:7:18: fatal error: nccl.h: No such file or directory
compilation terminated.
[2/7] /usr/local/cuda/bin/nvcc -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/balancing.cu -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70
FAILED: /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o
/usr/local/cuda/bin/nvcc -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/balancing.cu -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70
In file included from /home/xinglinpan/fastmoe-master/cuda/balancing.cuh:1:0,
from /home/xinglinpan/fastmoe-master/cuda/balancing.cu:2:
/home/xinglinpan/fastmoe-master/cuda/stream_manager.h:7:18: fatal error: nccl.h: No such file or directory
compilation terminated.
Try to fix
- Download nccl_2.7.8-1+cuda10.2_x86_64
- Set environment variables as mentioned
- USE_NCCL=1 python setup.py install
Installed /home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg
Processing dependencies for fastmoe==1.0.0
Finished processing dependencies for fastmoe==1.0.0
- cd test && pytest test_ddp.py
Traceback (most recent call last):
File "/home/xinglinpan/fastmoe-master/tests/test_ddp.py", line 139, in <module>
locals()[sys.argv[1]](**args)
File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
File "/home/xinglinpan/fastmoe-master/tests/test_ddp.py", line 139, in <module>
locals()[sys.argv[1]](**args)
File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Platform
- Device: GeForce RTX 2080Ti
- OS: Linux gpu9 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
- CUDA version: 10.2
- NCCL version: 2.7.8-1
- PyTorch version: 1.9.1
- Python Version: 3.8
Additional context
>>> torch.cuda.nccl.version()
2708
May some necessary environment variables be lost during the process of subprocess.Popen?
Issue Analytics
- State:
- Created a year ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
How to solve the famous `unhandled cuda error, NCCL ...
8 ncclUnhandledCudaError: Call to CUDA function failed. Bonus 1: I still have errors: ncclSystemError: System call (socket, malloc, munmap, etc) ...
Read more >Pytorch version incompatible with cuda
ncclUnhandledCudaError : Call to CUDA function failed. torch version: '1.11.0+cu102' cuda version: 11.4 driver: 470.63.01. Could you please tell ...
Read more >RuntimeError: NCCL Error 1: unhandled cuda error #11756
Issue description Get NCCL Error 1: unhandled cuda error when using ... version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.
Read more >ncclAllReduce failed: unhandled cuda error - DGX User Forum
We are using the experimental_compile option for some tf.function in the training code, so the code is partially compiled with XLA. We get...
Read more >DESIGNING OPTIMIZED MPI+NCCL HYBRID COLLECTIVE ...
ncclUnhandledCudaError. 1. A call to a CUDA function failed. ncclSystemError. 2. A call to the system failed. ncclInternalError. 3. An internal check failed....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I finally begin to understand the issue.
We updated the distributed parameter initialization with bcast over https://github.com/laekov/fastmoe/blob/master/fmoe/distributed.py#L100, which is not correct. In PyTorch’s distributed module, you are supposed to pass a global rank to the
broadcast
function, and it parses the global rank to local rank itself. (very stupid design, from my view) So, when we have multiple data parallel groups, rank0
is not in many other comms, which raises this issue.I will have that fixed later today.
We also try to use docker to fix its problem. Our Command
log