NCCL Asynchronous update timeout crash with Tutel MoE
See original GitHub issueHi, I am using Tutel library with MMAction framework to replicate Swin-v2 MoE performance described in the paper. However, I am facing this error when I try to train MoE in DDP setting. Can someone please help me in resolving this error? Alternatively, can you release the object detection code that was used in the Tutel paper.
E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLTOALL_BASE, Timeout(ms)=300000) ran for 306666 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLTOALL_BASE, Timeout(ms)=300000) ran for 306666 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6056 closing signal SIGTERM
Issue Analytics
- State:
- Created a year ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Issues · microsoft/tutel - GitHub
Tutel MoE: An Optimized Mixture-of-Experts Implementation - Issues · microsoft/tutel. ... NCCL Asynchronous update timeout crash with Tutel MoE.
Read more >Error: Some NCCL operations have failed or timed out
Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, ...
Read more >Simple index - piwheels
... f1z1-async-emitter test12313132321 datelist-sdk-python odoo12-addon-product-isp myimageproject yhmgit bandersnatch msensorapi django-test-timer centinel ...
Read more >vocab.txt - Hugging Face
... ta ##ead now height prod async clust represent ##ank symb ##ustom properties ... footer crash ##inel enforce ##update blockchain installation overridden ...
Read more >the 2 of 3 and 4 0 5 to 6 a 7 in 8 1 9 for 10 image 11 2 12 is 13
... identify 435 river 436 update 437 directive 438 propose 439 people 440 75 ... elmore 4571 crash 4572 1956 4573 af 4574...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
OK, since
tutel.examples.helloworld
works well, it should be related to inequivalent data sources stored on each GPU, which results in different planned iteration counts locally and thus triggers different number of model forwarding function. So, such timeout has to be solved at application side. But you can still try whether enabling both 2 following options can get rid of this problem: (1) settingcapacity_factor = negative_value
inside moe_layer creation in transformer initialization function; (2) always enabling_moe_layer_0.forward(.., inequivalent_tokens=True)
in transformer forwarding function.If the combination above doesn’t work, you have to change the way of data feeding in application side to guarantee all GPU always have same forwarding counts and execution orders.
Since
inequivalent_tokens=True
works, it means there is no issue from “inequivalent forwarding counts”. (See Case-1) It is only helpful when for each iteration, the “tokens per batch” on each device is not the same with others. (See Case-2)Case-1: where
inequivalent_tokens=True
is NOT helpfulCase-2: where
inequivalent_tokens=True
is helpful