NameError: name '_CAPI_DGLNCCLGetUniqueId' is not defined
See original GitHub issue🐛 Bug
Traceback (most recent call last): File “entity_sample.py”, line 196, in <module> main(args) File “entity_sample.py”, line 153, in main labels, emb_optimizer, optimizer) File “entity_sample.py”, line 98, in train emb_optimizer.step() File “/home/ubuntu/anaconda3/envs/RGCN/lib/python3.7/site-packages/dgl-0.8-py3.7-linux-x86_64.egg/dgl/optim/pytorch/sparse_optim.py”, line 80, in step self._comm_setup() File “/home/ubuntu/anaconda3/envs/RGCN/lib/python3.7/site-packages/dgl-0.8-py3.7-linux-x86_64.egg/dgl/optim/pytorch/sparse_optim.py”, line 109, in _comm_setup self._comm = nccl.Communicator(1, 0, nccl.UniqueId()) File “/home/ubuntu/anaconda3/envs/RGCN/lib/python3.7/site-packages/dgl-0.8-py3.7-linux-x86_64.egg/dgl/cuda/nccl.py”, line 22, in init self._handle = _CAPI_DGLNCCLGetUniqueId() NameError: name ‘_CAPI_DGLNCCLGetUniqueId’ is not defined
To Reproduce
Steps to reproduce the behavior:
git clone https://github.com/mufeili/dgl.git -b simplify
- Install DGL from source
cd dgl/examples/pytorch/rgcn
python entity_sample.py -d am --n-bases 40 --gpu 0 --fanout '35,35' --batch-size 64 --n-hidden 16 --use-self-loop --n-epochs=20 --dgl-sparse --sparse-lr 0.02 --dropout 0.7
Expected behavior
Running without error
Environment
- DGL Version (e.g., 1.0):
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.10.0
- OS (e.g., Linux): Linux
- How you installed DGL (
conda
,pip
, source): - Build command you used (if compiling from source):
- Python version: 3.8
- CUDA/cuDNN version (if applicable): 10.2
- GPU models and configuration (e.g. V100):
- Any other relevant information:
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Thanks. I think there are really two issues here:
-DUSE_NCCL=ON
, before attempting to call NCCL specific operations.To fix 1 we could either a) add a third code path for the case where we want to store things in the GPU, but do not need NCCL, or b) we add support for things like
_CAPI_DGLNCCLGetUniqueId()
and communicator creation when NCCL is not enabled, but restrict them to communicators of size 1, where no communication actually needs to take place.Generally, I think it’s better to fall back to a solution without NCCL if DGL is not built with NCCL.