question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The discard mechanism of MoE layer seems wrong.

See original GitHub issue

I am using the deepspeed.moe.layer.MoE. The most concerning part is the dropped inputs. Some inputs are dropped so that experts are not fed in too much data. But when generating outputs in this line. I notice the position holding the discarded inputs are all 0, which is due to the combined_weights are 0. Does not it should be the directly forwarded inputs?

To reproduce the results, run the Cifar example and print the intermediate outputs of the MoE layer. When the inputs are dispatched unevenly, the outputs will see all zero vectors.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
hobbitlzycommented, Oct 16, 2021

Thanks, @awan-10 @ykim362 . I previously followed the implementation from this repo. They also have a nice implementation but pass the dropped tokens. Anyway, I think you are right. What the Switch Transformer and Gshard recover the dropped tokens by residue connection.

I notice you have achieved similar results as Switch Transformer in this article. I have tried to integrate the MoE module of deepspeed with the training scripts of Huggingface, but I cannot get the performance post by the paper. Actually, my performance is the opposite: the more experts, the lower the sample training efficiency. I find there is one example of MoE on cifar in DeepSpeedExamples. Could you also share the scripts to reproduce the results on Switch Transformers?

0reactions
awan-10commented, Aug 16, 2022

Closing it due to being inactive for a long time. @hobbitlzy – please open if you have more updates on this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dense-to-Sparse Gate for Mixture-of-Experts | OpenReview
This paper proposes DTS, a simple yet effective training mechanism, which activates experts from densely to sparsely for MoE model. Although this mechanism...
Read more >
Transformer Feed-forward Layers are Mixtures of Experts
(2). We can study MoEfied models to interpret the in- ner mechanism of FFNs at a fine-grained level. In this work, we study...
Read more >
EvoMoE: An Evolutional Mixture-of-Experts Training ... - arXiv
Different from previous TopK-based gates, we propose the content-based gating mechanism, which activates experts whose weight is beyond.
Read more >
Print Settings - Slic3r Manual
For the bottom layers the important factor to consider is how the surface will look should there be a mistake whilst laying down...
Read more >
Efficient implementation of Mixture of Expert Layer in Pytorch
There is also a gating layer G_i(x_i) which is basically an attention mechanism over all sub-expert-layers: sum(G_i(x_i)*F_i(x_i).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found