Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[NCCL Error] Enable distributed expert feature

See original GitHub issue

Hi,

I installed fastmoe useding USE_NCCL=1 python setup.py install.

When i set “expert_dp_comm” to “dp”, the training process is fine. But when i set “expert_dp_comm” to “none” (i.e., Each worker serves several unique expert networks), the process has a nccl error:

NCCL Error at /home/h/code_gpt/fmoe-package/cuda/moe_comm_kernel.cu:29 value 4

I’m looking forward to the help!

My environment: python 1.8 nccl 2.8.3 cuda 10.1

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

laekovcommented, Apr 26, 2021

The distributed experts feature is by default enabled in fmoefy. You may double check the place where you call the function. In our experiment, we use NVIDIA V100 32GB. 12 experts are placed on each expert. In other words, our --num-expert is set to 12.

0reactions

BinHeRunningcommented, Apr 17, 2021

We built a docker image with PyTorch 1.8.0, CUDA 10.2, NCCL 2.7.8 and we have tested this image that it can be used directly to install FastMoE with distributed expert feature.

It can be found on Docker Hub: co1lin/fastmoe:pytorch1.8.0-cuda10.2-cudnn7-nccl2708

Thanks for the docker image.

I installed fastmoe using USE_NCCL=1, but when i run GPT2 (L12-H768, intermediate size 1536, top2)in a 8xGPU device, the largest expert number can be increased to 32. However, 96 experts reported in the FastMoE paper.

When i increase the expert number to 48 (batch size per gpu: 1), CUDA OOM occurs.

It seems that the distributed expert feature was not activated. Do you have any suggestions ?

Top Results From Across the Web

Troubleshooting — NCCL 2.16.2 documentation

ncclUnhandledCudaError and ncclSystemError indicate that a call to an external library failed. · ncclInvalidArgument and ncclInvalidUsage indicates there was a ...

Using EFA on the DLAMI - Deep Learning AMI

The following section describes how to use EFA to run multi-node applications on the AWS Deep Learning AMI.

Multi-GPU training on Windows 10? - PyTorch Forums

Feature Enable torch.distributed package supported on windows platform, ... The main issue was that there is no NCCL for Windows I think.

GPU training (Intermediate) - PyTorch Lightning - Read the Docs

Lightning supports multiple ways of doing distributed training. ... DataParallel wrapper and you may see errors or misbehavior if you assign state to...

Distributed training with Keras | TensorFlow Core

It allows you to carry out distributed training using existing models and ... Apply this scale function to the training and test data, ......