Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error met when using multi nodes

See original GitHub issue

Dear contributors, I meet an error with tutel’s moe layer. The error occurred when I run the tutel/examples/helloworld_ddp.py in the torch distributed mode with more than one GPU node (i.e.: 16 GPUs on 2 machines). However, It is fine when I run this script with 8 GPUs or less.

The error log is following:


[Benchmark] world_size = 16, dtype = float32, model_dim = 2048, hidden_size = 2048, samples = 65536, num_local_experts = 2, topK = 1, device = `cuda:0`
Traceback (most recent call last):
  File "tutel/examples/helloworld_ddp.py", line 154, in <module>
    output = model(x)
  File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "tutel/examples/helloworld_ddp.py", line 119, in forward
    result = self._moe_layer(input)
  File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/moe_layer.py", line 387, in forward
    result_output, l_aux = self.gate.apply_on_expert_fn(reshaped_input, self.expert_fn, self.group)
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/moe_layer.py", line 103, in apply_on_expert_fn
    locations1 = self.compute_location(masks_se[0])
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/jit_kernels/gating.py", line 81, in fast_cumsum_sub_one
    return get_cumsum_kernel(data.size(0), data.size(1))(data)
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
    base_kernel(mask1.to(torch.int32).contiguous(), locations1)
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/jit_compiler.py", line 39, in func
    tutel_custom_kernel.invoke_with_source(inputs, __ctx__, no_nvrtc, source)
RuntimeError: (0) == (cuModuleGetFunction(&gm.hFunc, gm.hMod, std::string(pos, tail - pos).c_str())) INTERNAL ASSERT FAILED at "/mnt/lustre/zhujinguo/codes/tutel/tutel/custom/custom_kernel.cpp":208, please report a bug to PyTorch. CHECK_EQ fails.

Also, I use tutel moe layer in another project, where the same thing happened.