question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error in load_importance_loss

See original GitHub issue

Hi I had the errors when using load_importance_loss (the code works fine when using gshard_loss). Does anyone have an idea about it?

The error log (in one rank/node) is in below:

[4]:
  time      : 2022-07-06_11:47:24
  host      : SG-IDC1-10-51-2-36
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 55010)
  error_file: /tmp/torchelastic_kuhg0qco/none_62gucqgc/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
      return forward_call(*input, **kwargs)
    File "/mnt/lustre/bli/projects/Pretraining-DG/mae/models_moe_mae.py", line 75, in forward
      x_temp = self.mlp(self.norm2(x))
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
      return forward_call(*input, **kwargs)
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 231, in forward
      logits_dtype, (crit, l_aux) = routing()
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 218, in routing
      return logits.dtype, extract_critical(scores,
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/fast_dispatch.py", line 150, in extract_critical
      l_loss = loss_fn(scores, topk_indices) if loss_fn is not None else None
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 215, in <lambda>
      _loss_fn = lambda gates, topk_ids: losses.load_importance_loss(
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/losses.py", line 41, in load_importance_loss
      l_load = load_loss(scores_wo_noise, topk_logits, num_global_experts, gate_noise)
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/losses.py", line 23, in load_loss
      normal = Normal(
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/distributions/normal.py", line 54, in __init__
      super(Normal, self).__init__(batch_shape, validate_args=validate_args)
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/distributions/distribution.py", line 55, in __init__
      raise ValueError(
  ValueError: Expected parameter scale (Tensor of shape (1,)) of distribution Normal(loc: tensor([0.], device='cuda:4'), scale: tensor([0.], device='cuda:4')) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
  tensor([0.], device='cuda:4')

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
Luodiancommented, Aug 1, 2022

Yep, and I also found an issue when using cosine projector.

It seems that in cosine_top.py line 31, there should be an .cuda() or .to(device) flag to make sure the tensor in same device.

logit_scale = torch.clamp(self.temperature, max=torch.log(torch.tensor(1. / 0.01)).cuda()).exp()
0reactions
ghostplantcommented, Aug 1, 2022

We have added gate_noise assertion and device cast in latest commit. Thanks for pointing out this bug.

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found