[NCCL Error] Enable distributed expert feature
See original GitHub issueHi,
I installed fastmoe useding USE_NCCL=1 python setup.py install
.
When i set “expert_dp_comm” to “dp”, the training process is fine. But when i set “expert_dp_comm” to “none” (i.e., Each worker serves several unique expert networks), the process has a nccl error:
NCCL Error at /home/h/code_gpt/fmoe-package/cuda/moe_comm_kernel.cu:29 value 4
I’m looking forward to the help!
My environment: python 1.8 nccl 2.8.3 cuda 10.1
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Troubleshooting — NCCL 2.16.2 documentation
ncclUnhandledCudaError and ncclSystemError indicate that a call to an external library failed. · ncclInvalidArgument and ncclInvalidUsage indicates there was a ...
Read more >Using EFA on the DLAMI - Deep Learning AMI
The following section describes how to use EFA to run multi-node applications on the AWS Deep Learning AMI.
Read more >Multi-GPU training on Windows 10? - PyTorch Forums
Feature Enable torch.distributed package supported on windows platform, ... The main issue was that there is no NCCL for Windows I think.
Read more >GPU training (Intermediate) - PyTorch Lightning - Read the Docs
Lightning supports multiple ways of doing distributed training. ... DataParallel wrapper and you may see errors or misbehavior if you assign state to...
Read more >Distributed training with Keras | TensorFlow Core
It allows you to carry out distributed training using existing models and ... Apply this scale function to the training and test data, ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The distributed experts feature is by default enabled in
fmoefy
. You may double check the place where you call the function. In our experiment, we use NVIDIA V100 32GB.12
experts are placed on each expert. In other words, our--num-expert
is set to12
.Thanks for the docker image.
I installed fastmoe using USE_NCCL=1, but when i run GPT2 (L12-H768, intermediate size 1536, top2)in a 8xGPU device, the largest expert number can be increased to 32. However, 96 experts reported in the FastMoE paper.
When i increase the expert number to 48 (batch size per gpu: 1), CUDA OOM occurs.
It seems that the distributed expert feature was not activated. Do you have any suggestions ?