Error met when using multi nodes
See original GitHub issueDear contributors,
I meet an error with tutel’s moe layer.
The error occurred when I run the tutel/examples/helloworld_ddp.py
in the torch distributed mode with more than one GPU node (i.e.: 16 GPUs on 2 machines).
However, It is fine when I run this script with 8 GPUs or less.
The error log is following:
[Benchmark] world_size = 16, dtype = float32, model_dim = 2048, hidden_size = 2048, samples = 65536, num_local_experts = 2, topK = 1, device = `cuda:0`
Traceback (most recent call last):
File "tutel/examples/helloworld_ddp.py", line 154, in <module>
output = model(x)
File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "tutel/examples/helloworld_ddp.py", line 119, in forward
result = self._moe_layer(input)
File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/moe_layer.py", line 387, in forward
result_output, l_aux = self.gate.apply_on_expert_fn(reshaped_input, self.expert_fn, self.group)
File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/moe_layer.py", line 103, in apply_on_expert_fn
locations1 = self.compute_location(masks_se[0])
File "/mnt/lustre/zhujinguo/codes/tutel/tutel/jit_kernels/gating.py", line 81, in fast_cumsum_sub_one
return get_cumsum_kernel(data.size(0), data.size(1))(data)
File "/mnt/lustre/zhujinguo/codes/tutel/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
base_kernel(mask1.to(torch.int32).contiguous(), locations1)
File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/jit_compiler.py", line 39, in func
tutel_custom_kernel.invoke_with_source(inputs, __ctx__, no_nvrtc, source)
RuntimeError: (0) == (cuModuleGetFunction(&gm.hFunc, gm.hMod, std::string(pos, tail - pos).c_str())) INTERNAL ASSERT FAILED at "/mnt/lustre/zhujinguo/codes/tutel/tutel/custom/custom_kernel.cpp":208, please report a bug to PyTorch. CHECK_EQ fails.
Also, I use tutel moe layer in another project, where the same thing happened.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Multi-node training meets unknown error #79847 - GitHub
I want to run 4 processes with 2 procs per node. It is weird that after the running of multi-node distributed failed, even...
Read more >Troubleshoot multi-node - Axway Documentation Portal
In a multi-node configuration, if a host address is incorrectly set in uconf, COPSMNG and all Transfer CFT tasks start correctly.
Read more >'unhandled system error' when training with multi nodes
I met an error when I use DDP for multi node (two node, two GPUs each) training and 'nccl' backend (It runs perfect...
Read more >Node.js: multi-node No such module - Stack Overflow
This explains your error message. Unfortunately, there doesn't seem to be a fix. As a workaround, perhaps you should use Node's cluster module...
Read more >Recover from Volume Multi Attach Error in On-Prem ...
I have a three node cluster with k8s version 1.15.3, to reproduce the Volume multi-attach error scenario. Deployed OpenEBS version 1.3, ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thank you! I have solved this problem by upgrading to
torch1.10+cuda11.1
.And
export NO_NVRTC=0
is also needed.Thank you for your information. I’ll just close this issue if it is fixed.