question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NCCL Asynchronous update timeout crash with Tutel MoE

See original GitHub issue

Hi, I am using Tutel library with MMAction framework to replicate Swin-v2 MoE performance described in the paper. However, I am facing this error when I try to train MoE in DDP setting. Can someone please help me in resolving this error? Alternatively, can you release the object detection code that was used in the Tutel paper.

E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLTOALL_BASE, Timeout(ms)=300000) ran for 306666 milliseconds before timing out.

[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

terminate called after throwing an instance of 'std::runtime_error'

  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLTOALL_BASE, Timeout(ms)=300000) ran for 306666 milliseconds before timing out.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6056 closing signal SIGTERM

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
ghostplantcommented, Sep 6, 2022

OK, since tutel.examples.helloworld works well, it should be related to inequivalent data sources stored on each GPU, which results in different planned iteration counts locally and thus triggers different number of model forwarding function. So, such timeout has to be solved at application side. But you can still try whether enabling both 2 following options can get rid of this problem: (1) setting capacity_factor = negative_value inside moe_layer creation in transformer initialization function; (2) always enabling _moe_layer_0.forward(.., inequivalent_tokens=True) in transformer forwarding function.

If the combination above doesn’t work, you have to change the way of data feeding in application side to guarantee all GPU always have same forwarding counts and execution orders.

1reaction
ghostplantcommented, Sep 7, 2022

Since inequivalent_tokens=True works, it means there is no issue from “inequivalent forwarding counts”. (See Case-1) It is only helpful when for each iteration, the “tokens per batch” on each device is not the same with others. (See Case-2)

Case-1: where inequivalent_tokens=True is NOT helpful

        [GPU-0]          [GPU-1]        [...]
     epoch0-step0      epoch0-step0
     epoch0-step1      epoch0-step1
         ...                ...
     epoch0-step100    epoch0-step100
     epoch0-step101    epoch1-step0     <--
     epoch1-step0      epoch1-step1
         ...                ...

Case-2: where inequivalent_tokens=True is helpful

        [GPU-0]          [GPU-1]
     step-0 (bs=16)    step-0 (bs=16)
     step-1 (bs=16)    step-1 (bs=16)
         ...                ...
     step-50 (bs=16)   step-50 (bs=16)
     step-51 (bs=3)    step-51 (bs=11)  <--
         ...                ...
Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · microsoft/tutel - GitHub
Tutel MoE: An Optimized Mixture-of-Experts Implementation - Issues · microsoft/tutel. ... NCCL Asynchronous update timeout crash with Tutel MoE.
Read more >
Error: Some NCCL operations have failed or timed out
Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, ...
Read more >
Simple index - piwheels
... f1z1-async-emitter test12313132321 datelist-sdk-python odoo12-addon-product-isp myimageproject yhmgit bandersnatch msensorapi django-test-timer centinel ...
Read more >
vocab.txt - Hugging Face
... ta ##ead now height prod async clust represent ##ank symb ##ustom properties ... footer crash ##inel enforce ##update blockchain installation overridden ...
Read more >
the 2 of 3 and 4 0 5 to 6 a 7 in 8 1 9 for 10 image 11 2 12 is 13
... identify 435 river 436 update 437 directive 438 propose 439 people 440 75 ... elmore 4571 crash 4572 1956 4573 af 4574...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found