Error in load_importance_loss
See original GitHub issueHi I had the errors when using load_importance_loss
(the code works fine when using gshard_loss
). Does anyone have an idea about it?
The error log (in one rank/node) is in below:
[4]:
time : 2022-07-06_11:47:24
host : SG-IDC1-10-51-2-36
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 55010)
error_file: /tmp/torchelastic_kuhg0qco/none_62gucqgc/attempt_0/4/error.json
traceback : Traceback (most recent call last):
File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/lustre/bli/projects/Pretraining-DG/mae/models_moe_mae.py", line 75, in forward
x_temp = self.mlp(self.norm2(x))
File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 231, in forward
logits_dtype, (crit, l_aux) = routing()
File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 218, in routing
return logits.dtype, extract_critical(scores,
File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/fast_dispatch.py", line 150, in extract_critical
l_loss = loss_fn(scores, topk_indices) if loss_fn is not None else None
File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 215, in <lambda>
_loss_fn = lambda gates, topk_ids: losses.load_importance_loss(
File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/losses.py", line 41, in load_importance_loss
l_load = load_loss(scores_wo_noise, topk_logits, num_global_experts, gate_noise)
File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/losses.py", line 23, in load_loss
normal = Normal(
File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/distributions/normal.py", line 54, in __init__
super(Normal, self).__init__(batch_shape, validate_args=validate_args)
File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/distributions/distribution.py", line 55, in __init__
raise ValueError(
ValueError: Expected parameter scale (Tensor of shape (1,)) of distribution Normal(loc: tensor([0.], device='cuda:4'), scale: tensor([0.], device='cuda:4')) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([0.], device='cuda:4')
Issue Analytics
- State:
- Created a year ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yep, and I also found an issue when using cosine projector.
It seems that in
cosine_top.py
line 31, there should be an .cuda() or .to(device) flag to make sure the tensor in same device.We have added
gate_noise
assertion and device cast in latest commit. Thanks for pointing out this bug.