Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error in load_importance_loss

See original GitHub issue

Hi I had the errors when using load_importance_loss (the code works fine when using gshard_loss). Does anyone have an idea about it?

The error log (in one rank/node) is in below:

[4]:
  time      : 2022-07-06_11:47:24
  host      : SG-IDC1-10-51-2-36
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 55010)
  error_file: /tmp/torchelastic_kuhg0qco/none_62gucqgc/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
      return forward_call(*input, **kwargs)
    File "/mnt/lustre/bli/projects/Pretraining-DG/mae/models_moe_mae.py", line 75, in forward
      x_temp = self.mlp(self.norm2(x))
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
      return forward_call(*input, **kwargs)
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 231, in forward
      logits_dtype, (crit, l_aux) = routing()
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 218, in routing
      return logits.dtype, extract_critical(scores,
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/fast_dispatch.py", line 150, in extract_critical
      l_loss = loss_fn(scores, topk_indices) if loss_fn is not None else None
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 215, in <lambda>
      _loss_fn = lambda gates, topk_ids: losses.load_importance_loss(
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/losses.py", line 41, in load_importance_loss
      l_load = load_loss(scores_wo_noise, topk_logits, num_global_experts, gate_noise)
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/losses.py", line 23, in load_loss
      normal = Normal(
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/distributions/normal.py", line 54, in __init__
      super(Normal, self).__init__(batch_shape, validate_args=validate_args)
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/distributions/distribution.py", line 55, in __init__
      raise ValueError(
  ValueError: Expected parameter scale (Tensor of shape (1,)) of distribution Normal(loc: tensor([0.], device='cuda:4'), scale: tensor([0.], device='cuda:4')) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
  tensor([0.], device='cuda:4')

Issue Analytics

State:
Created a year ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

Luodiancommented, Aug 1, 2022

Yep, and I also found an issue when using cosine projector.

It seems that in cosine_top.py line 31, there should be an .cuda() or .to(device) flag to make sure the tensor in same device.

logit_scale = torch.clamp(self.temperature, max=torch.log(torch.tensor(1. / 0.01)).cuda()).exp()

0reactions

ghostplantcommented, Aug 1, 2022

We have added gate_noise assertion and device cast in latest commit. Thanks for pointing out this bug.

Top Results From Across the Web

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Error in load_importance_loss

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Cannot Import JIT optimized kernels?

module 'tutel_custom_kernel' has no attribute 'inject_source'