Dropout in embedding layer
See original GitHub issueIn this paper, the authors state that applying dropout to the input of an embedding layer by selectively dropping certain ids is an effective method for preventing overfitting. For example, if the embedding is a word2vec embedding, this method of dropout might drop the word “the” from the entire input sequence. In this case, the input “the dog and the cat” would become “-- dog and – cat”. The input would never become “-- dog and the cat”. This is useful to prevent the model from depending on certain words.
Although keras currently allows for applying dropout to the output vector of an embedding layer, as far as I can read from the documents, it does not allow for applying dropout selectively to certain ids. Since embeddings are frequently used, and the above paper states that embeddings are prone to overfitting, this feature seems to be a feature that would be useful for a relatively wide range of users. The expected API would be something like
from keras.layers import Embedding
embedding = Embedding(x, y, dropout=0.2)
where the dropout rate signifies the rate of ids to drop. Would this be a worthy feature to add? Or is there a relatively obvious way to implement this functionality already?
Issue Analytics
- State:
- Created 6 years ago
- Reactions:12
- Comments:10 (2 by maintainers)
I was also trying to find a solution for (word) embedding dropout.
The Dropout specification says: "noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape
(batch_size, timesteps, features)
and you want the dropout mask to be the same for all timesteps, you can usenoise_shape=(batch_size, 1, features)
."See also the SpatialDropout1D implementation here (https://github.com/keras-team/keras/blob/master/keras/layers/core.py), which actually uses the mask that is mentioned above.
So, SpatialDropout1D performs variational dropout at least in terms of models related to NLP. We tested both
Dropout(noise_shape=(batch_size, 1, features))
andSpatialDropout()
and as we expected, they both apply variational dropout (https://arxiv.org/pdf/1512.05287.pdf).So, if you need to drop a full word type (embedding), you have to use
noise_shape=(batch_size, sequence_size, 1)
in Dropout layer, or you can also create a new layer based on the SpatialDropout1D paradigm, like:Please let me know if you find this helpful and most importantly correct in terms of the expected behaviour. @keitakurita @riadsouissi
@JohnGiorgi
can be translated as
This code does not zero-out word types per sequence as the authors claim, which means that:
S1: “The big brown dog was playing with another black dog”
if we apply dropout 0.2, the sentence can be translated both as S1’ and S1’’
S1’: “The big brown - was playing with another black -” S1’':“The big - dog was playing with another black -”
If you want to zero-out word embeddings based on their word type (which means, as you quoted before, that “the” embedding is masked), I think you should bring back the old Embeddings layer from Keras 1 (https://github.com/keras-team/keras/blob/keras-1/keras/layers/embeddings.py), which is actually dropping random word-types per step in my understanding:
My honest question would be “Does it really matters?”.
The intuition is to mask some words and learn to take correct decisions (predictions) without those words, in other words to avoid overfit in specific words, that possibly tend to have many occurrences in the dataset. But how often we see a task key-word, something that really matters (e.g. a person’s name or an indicative verb when we train a NER) and not a stop-word, as the quote describes (“the”), twice or more in a single sentence?