question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error met when using multi nodes

See original GitHub issue

Dear contributors, I meet an error with tutel’s moe layer. The error occurred when I run the tutel/examples/helloworld_ddp.py in the torch distributed mode with more than one GPU node (i.e.: 16 GPUs on 2 machines). However, It is fine when I run this script with 8 GPUs or less.

The error log is following:


[Benchmark] world_size = 16, dtype = float32, model_dim = 2048, hidden_size = 2048, samples = 65536, num_local_experts = 2, topK = 1, device = `cuda:0`
Traceback (most recent call last):
  File "tutel/examples/helloworld_ddp.py", line 154, in <module>
    output = model(x)
  File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "tutel/examples/helloworld_ddp.py", line 119, in forward
    result = self._moe_layer(input)
  File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/moe_layer.py", line 387, in forward
    result_output, l_aux = self.gate.apply_on_expert_fn(reshaped_input, self.expert_fn, self.group)
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/moe_layer.py", line 103, in apply_on_expert_fn
    locations1 = self.compute_location(masks_se[0])
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/jit_kernels/gating.py", line 81, in fast_cumsum_sub_one
    return get_cumsum_kernel(data.size(0), data.size(1))(data)
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
    base_kernel(mask1.to(torch.int32).contiguous(), locations1)
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/jit_compiler.py", line 39, in func
    tutel_custom_kernel.invoke_with_source(inputs, __ctx__, no_nvrtc, source)
RuntimeError: (0) == (cuModuleGetFunction(&gm.hFunc, gm.hMod, std::string(pos, tail - pos).c_str())) INTERNAL ASSERT FAILED at "/mnt/lustre/zhujinguo/codes/tutel/tutel/custom/custom_kernel.cpp":208, please report a bug to PyTorch. CHECK_EQ fails.

Also, I use tutel moe layer in another project, where the same thing happened.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Lechateliacommented, Dec 23, 2021

Thank you! I have solved this problem by upgrading to torch1.10+cuda11.1.

And export NO_NVRTC=0 is also needed.

0reactions
ghostplantcommented, Dec 23, 2021

Thank you for your information. I’ll just close this issue if it is fixed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi-node training meets unknown error #79847 - GitHub
I want to run 4 processes with 2 procs per node. It is weird that after the running of multi-node distributed failed, even...
Read more >
Troubleshoot multi-node - Axway Documentation Portal
In a multi-node configuration, if a host address is incorrectly set in uconf, COPSMNG and all Transfer CFT tasks start correctly.
Read more >
'unhandled system error' when training with multi nodes
I met an error when I use DDP for multi node (two node, two GPUs each) training and 'nccl' backend (It runs perfect...
Read more >
Node.js: multi-node No such module - Stack Overflow
This explains your error message. Unfortunately, there doesn't seem to be a fix. As a workaround, perhaps you should use Node's cluster module...
Read more >
Recover from Volume Multi Attach Error in On-Prem ...
I have a three node cluster with k8s version 1.15.3, to reproduce the Volume multi-attach error scenario. Deployed OpenEBS version 1.3, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found