Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug report

See original GitHub issue

There’s a bug when using attention layer. In this line: https://github.com/facebookresearch/ParlAI/blob/55fcf6127309f3c0e2f15c1fe6eae1fd71afcbcb/parlai/agents/seq2seq/modules.py#L80 new hidden states are returned, but never used for getting next prediction. This is the reason why attention model performs extremely bad. Here’s the result for just 30 mins training:

TEXT:  I get to read the articles of extradition acordind to the European Court of human rights .
PREDICTION:  i was just a little bit of a lot of people .
~
TEXT:  Yes , you are the very monster I created
PREDICTION:  i will be a good thing
~
TEXT:  Hello , detective Spooner .
PREDICTION:  i don' t know .
~
TEXT:  I' m a tiger .
PREDICTION:  i don' t know .
~
TEXT:  What' ve you got ?
PREDICTION:  i don' t know .
~
TEXT:  We are going to change the way we see the road .
PREDICTION:  i don' t know what you are .

What’s more, attention model (using local for Twitter and general for Opensubtitles) can really make loss lower.

The default value of lookuptable https://github.com/facebookresearch/ParlAI/blob/55fcf6127309f3c0e2f15c1fe6eae1fd71afcbcb/parlai/agents/seq2seq/seq2seq.py#L107 will cause much more memory usage, but I didn’t find out the reason. Old value all works fine.
In this line of vectorize() function, https://github.com/facebookresearch/ParlAI/blob/55fcf6127309f3c0e2f15c1fe6eae1fd71afcbcb/parlai/agents/seq2seq/seq2seq.py#L403 it only returns 6 values, but newer version needs 7.

Issue Analytics

State:
Created 6 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

alexholdenmillercommented, Feb 26, 2018

wow, thanks. that was some copypasta from the line above it but was really hurting training with attention. thanks for the catch.
unique uses 3x more memory than all, intentionally. all shares the same tensor for the weight of the encoder Embedding layer, the decoder Embedding layer, and the final Linear layer producing an output token. unique keeps them separate, and enc_dec and dec_out share the mentioned pairs.
fixing, thanks.