Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dropout in embedding layer

See original GitHub issue

In this paper, the authors state that applying dropout to the input of an embedding layer by selectively dropping certain ids is an effective method for preventing overfitting. For example, if the embedding is a word2vec embedding, this method of dropout might drop the word “the” from the entire input sequence. In this case, the input “the dog and the cat” would become “-- dog and – cat”. The input would never become “-- dog and the cat”. This is useful to prevent the model from depending on certain words.

Although keras currently allows for applying dropout to the output vector of an embedding layer, as far as I can read from the documents, it does not allow for applying dropout selectively to certain ids. Since embeddings are frequently used, and the above paper states that embeddings are prone to overfitting, this feature seems to be a feature that would be useful for a relatively wide range of users. The expected API would be something like

from keras.layers import Embedding

embedding = Embedding(x, y, dropout=0.2)

where the dropout rate signifies the rate of ids to drop. Would this be a worthy feature to add? Or is there a relatively obvious way to implement this functionality already?

Issue Analytics

State:
Created 6 years ago
Reactions:12
Comments:10 (2 by maintainers)

Top GitHub Comments

11reactions

iliaschalkidiscommented, May 28, 2018

I was also trying to find a solution for (word) embedding dropout.

The Dropout specification says: "noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features)."

See also the SpatialDropout1D implementation here (https://github.com/keras-team/keras/blob/master/keras/layers/core.py), which actually uses the mask that is mentioned above.

So, SpatialDropout1D performs variational dropout at least in terms of models related to NLP. We tested both Dropout(noise_shape=(batch_size, 1, features)) and SpatialDropout() and as we expected, they both apply variational dropout (https://arxiv.org/pdf/1512.05287.pdf).

So, if you need to drop a full word type (embedding), you have to use noise_shape=(batch_size, sequence_size, 1) in Dropout layer, or you can also create a new layer based on the SpatialDropout1D paradigm, like:

class TimestepDropout(Dropout):
    """Timestep Dropout.

    This version performs the same function as Dropout, however it drops
    entire timesteps (e.g., words embeddings in NLP tasks) instead of individual elements (features).

    # Arguments
        rate: float between 0 and 1. Fraction of the timesteps to drop.

    # Input shape
        3D tensor with shape:
        `(samples, timesteps, channels)`

    # Output shape
        Same as input

    # References
        - A Theoretically Grounded Application of Dropout in Recurrent Neural Networks (https://arxiv.org/pdf/1512.05287)
    """

    def __init__(self, rate, **kwargs):
        super(TimestepDropout, self).__init__(rate, **kwargs)
        self.input_spec = InputSpec(ndim=3)

    def _get_noise_shape(self, inputs):
        input_shape = K.shape(inputs)
        noise_shape = (input_shape[0], input_shape[1], 1)
        return noise_shape

Please let me know if you find this helpful and most importantly correct in terms of the expected behaviour. @keitakurita @riadsouissi

2reactions

iliaschalkidiscommented, Jun 1, 2018

@JohnGiorgi

word_embeddings = Embedding() # first map to embeddings
word_embeddings = TimestepDropout(0.10)(word_embeddings) # then zero-out word embeddings
word_embeddings = SpatialDropout1D(0.50)(word_embeddings) # and possibly drop some dimensions on every single embedding (timestep)

can be translated as

“… it is therefore more efficient to first map the words to the word embeddings, and only then to zero-out word embeddings [NOT: based on their word type.]”

This code does not zero-out word types per sequence as the authors claim, which means that:

S1: “The big brown dog was playing with another black dog”

if we apply dropout 0.2, the sentence can be translated both as S1’ and S1’’

S1’: “The big brown - was playing with another black -” S1’':“The big - dog was playing with another black -”

If you want to zero-out word embeddings based on their word type (which means, as you quoted before, that “the” embedding is masked), I think you should bring back the old Embeddings layer from Keras 1 (https://github.com/keras-team/keras/blob/keras-1/keras/layers/embeddings.py), which is actually dropping random word-types per step in my understanding:

    def call(self, x, mask=None):
        if 0. < self.dropout < 1.:
            retain_p = 1. - self.dropout
            B = K.random_binomial((self.input_dim,), p=retain_p) * (1. / retain_p)
            B = K.expand_dims(B)
            W = K.in_train_phase(self.W * B, self.W)
        else:
            W = self.W
        out = K.gather(W, x)
        return out

My honest question would be “Does it really matters?”.

The intuition is to mask some words and learn to take correct decisions (predictions) without those words, in other words to avoid overfit in specific words, that possibly tend to have many occurrences in the dataset. But how often we see a task key-word, something that really matters (e.g. a person’s name or an indicative verb when we train a NER) and not a stop-word, as the quote describes (“the”), twice or more in a single sentence?

Top Results From Across the Web

How is dropout applied to the embedding layer's output?

After an Dense Layer, the Dropout inputs are directly the outputs of the Dense layer neurons, as you said. After your embedding layer,...

python - Does applying a Dropout Layer after the Embedding ...

When you add a dropout layer you're adding dropout to the output of the previous layer only, in your case you are adding...

A review of Dropout as applied to RNNs part 2. | by Adrian G

Embedding dropout (dropoute, abbreviated here as de) applies dropout to remove words from the embedding layer. “Following Gal & Ghahramani (2016), ...

Word Embedding Dropout and Variable-Length Convolution ...

For deep convolutional neural networks, dropout is known to work well in the fully-connected layer. In this paper, we use dropout technique in ......

Dropout Neural Network Layer In Keras Explained

Therefore, anything we can do to generalize the performance of our model is seen as a net gain. Dropout is a technique used...