question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[NCCL Error] Enable distributed expert feature

See original GitHub issue

Hi,

I installed fastmoe useding USE_NCCL=1 python setup.py install.

When i set “expert_dp_comm” to “dp”, the training process is fine. But when i set “expert_dp_comm” to “none” (i.e., Each worker serves several unique expert networks), the process has a nccl error:

NCCL Error at /home/h/code_gpt/fmoe-package/cuda/moe_comm_kernel.cu:29 value 4

I’m looking forward to the help!

My environment: python 1.8 nccl 2.8.3 cuda 10.1

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
laekovcommented, Apr 26, 2021

The distributed experts feature is by default enabled in fmoefy. You may double check the place where you call the function. In our experiment, we use NVIDIA V100 32GB. 12 experts are placed on each expert. In other words, our --num-expert is set to 12.

0reactions
BinHeRunningcommented, Apr 17, 2021

We built a docker image with PyTorch 1.8.0, CUDA 10.2, NCCL 2.7.8 and we have tested this image that it can be used directly to install FastMoE with distributed expert feature.

It can be found on Docker Hub: co1lin/fastmoe:pytorch1.8.0-cuda10.2-cudnn7-nccl2708

Thanks for the docker image.

I installed fastmoe using USE_NCCL=1, but when i run GPT2 (L12-H768, intermediate size 1536, top2)in a 8xGPU device, the largest expert number can be increased to 32. However, 96 experts reported in the FastMoE paper.

When i increase the expert number to 48 (batch size per gpu: 1), CUDA OOM occurs.

It seems that the distributed expert feature was not activated. Do you have any suggestions ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting — NCCL 2.16.2 documentation
ncclUnhandledCudaError and ncclSystemError indicate that a call to an external library failed. · ncclInvalidArgument and ncclInvalidUsage indicates there was a ...
Read more >
Using EFA on the DLAMI - Deep Learning AMI
The following section describes how to use EFA to run multi-node applications on the AWS Deep Learning AMI.
Read more >
Multi-GPU training on Windows 10? - PyTorch Forums
Feature Enable torch.distributed package supported on windows platform, ... The main issue was that there is no NCCL for Windows I think.
Read more >
GPU training (Intermediate) - PyTorch Lightning - Read the Docs
Lightning supports multiple ways of doing distributed training. ... DataParallel wrapper and you may see errors or misbehavior if you assign state to...
Read more >
Distributed training with Keras | TensorFlow Core
It allows you to carry out distributed training using existing models and ... Apply this scale function to the training and test data, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found