Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NCCL Asynchronous update timeout crash with Tutel MoE

See original GitHub issue

Hi, I am using Tutel library with MMAction framework to replicate Swin-v2 MoE performance described in the paper. However, I am facing this error when I try to train MoE in DDP setting. Can someone please help me in resolving this error? Alternatively, can you release the object detection code that was used in the Tutel paper.

E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLTOALL_BASE, Timeout(ms)=300000) ran for 306666 milliseconds before timing out.

[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

terminate called after throwing an instance of 'std::runtime_error'

  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLTOALL_BASE, Timeout(ms)=300000) ran for 306666 milliseconds before timing out.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6056 closing signal SIGTERM

Issue Analytics

State:
Created a year ago
Comments:5 (3 by maintainers)

Top GitHub Comments

2reactions

ghostplantcommented, Sep 6, 2022

OK, since tutel.examples.helloworld works well, it should be related to inequivalent data sources stored on each GPU, which results in different planned iteration counts locally and thus triggers different number of model forwarding function. So, such timeout has to be solved at application side. But you can still try whether enabling both 2 following options can get rid of this problem: (1) setting capacity_factor = negative_value inside moe_layer creation in transformer initialization function; (2) always enabling _moe_layer_0.forward(.., inequivalent_tokens=True) in transformer forwarding function.

If the combination above doesn’t work, you have to change the way of data feeding in application side to guarantee all GPU always have same forwarding counts and execution orders.

1reaction

ghostplantcommented, Sep 7, 2022

Since inequivalent_tokens=True works, it means there is no issue from “inequivalent forwarding counts”. (See Case-1) It is only helpful when for each iteration, the “tokens per batch” on each device is not the same with others. (See Case-2)

Case-1: where inequivalent_tokens=True is NOT helpful

        [GPU-0]          [GPU-1]        [...]
     epoch0-step0      epoch0-step0
     epoch0-step1      epoch0-step1
         ...                ...
     epoch0-step100    epoch0-step100
     epoch0-step101    epoch1-step0     <--
     epoch1-step0      epoch1-step1
         ...                ...

Case-2: where inequivalent_tokens=True is helpful

        [GPU-0]          [GPU-1]
     step-0 (bs=16)    step-0 (bs=16)
     step-1 (bs=16)    step-1 (bs=16)
         ...                ...
     step-50 (bs=16)   step-50 (bs=16)
     step-51 (bs=3)    step-51 (bs=11)  <--
         ...                ...

Top Results From Across the Web

Issues · microsoft/tutel - GitHub

Tutel MoE: An Optimized Mixture-of-Experts Implementation - Issues · microsoft/tutel. ... NCCL Asynchronous update timeout crash with Tutel MoE.

Error: Some NCCL operations have failed or timed out

Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, ...

Simple index - piwheels

... f1z1-async-emitter test12313132321 datelist-sdk-python odoo12-addon-product-isp myimageproject yhmgit bandersnatch msensorapi django-test-timer centinel ...

vocab.txt - Hugging Face

... ta ##ead now height prod async clust represent ##ank symb ##ustom properties ... footer crash ##inel enforce ##update blockchain installation overridden ...

the 2 of 3 and 4 0 5 to 6 a 7 in 8 1 9 for 10 image 11 2 12 is 13

... identify 435 river 436 update 437 directive 438 propose 439 people 440 75 ... elmore 4571 crash 4572 1956 4573 af 4574...