Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to add Attention on top of a Recurrent Layer (Text Classification)

See original GitHub issue

I am doing text classification. Also I am using my pre-trained word embeddings and i have a LSTM layer on top with a softmax at the end.

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

Pretty simple. Now I want to add attention to the model, but i don’t know how to do it.

My understanding is that i have to set return_sequences=True so as the attention layer will weigh each timestep accordingly. This way the LSTM will return a 3D Tensor, right? After that what do i have to do? Is there a way to easily implement a model with attention using Keras Layers or do i have to write my own custom layer?

If this can be done with the available Keras Layers, I would really appreciate an example.

Issue Analytics

State:
Created 7 years ago
Reactions:42
Comments:116 (20 by maintainers)

Top GitHub Comments

99reactions

patyorkcommented, Jan 7, 2017

@baziotis This area is supposed to be more for bugs as opposed to “how to implement” questions. I admit I don’t often look at the google group, but that is a valid place to ask these questions, as well as on the Slack channel.

Bengio et. al has a pretty good paper on attention (soft attention is the softmax attention).

An example of method a) I described:

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False))
model.add(Activation('softmax')) #this guy here
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

example b), with simple activation:

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False, activation='softmax'))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

example b) with sigmoid and then softmax (non-working, but the idea):

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

def myAct(out):
    return K.softmax(K.tanh(out))

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False, activation=myAct))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

In addition, I should say that my notes about whether a) or b) above is what you probably need are based on your example, where you want one output (making option b probably the correct way). Attention is often used in spaces like caption generation where there is more than 1 output such as setting return_sequences=True. For those cases, I think that option a) is the described usage, such that the recurrency keeps all the information passing forward, and it’s just the higher layers that utilize the attention.

75reactions

mbollmanncommented, Jan 10, 2017

@patyork, I’m sorry, but I don’t see how this implements attention at all?

From my understanding, the softmax in the Bengio et al. paper is not applied over the LSTM output, but over the output of an attention model, which is calculated from the LSTM’s hidden state at a given timestep. The output of the softmax is then used to modify the LSTM’s internal state. Essentially, attention is something that happens within an LSTM since it is both based on and modifies its internal states.

I actually made my own attempt to create an attentional LSTM in Keras, based on the very same paper you cited, which I’ve shared here:

https://gist.github.com/mbollmann/ccc735366221e4dba9f89d2aab86da1e

There are several different ways to incorporate attention into an LSTM, and I won’t claim 100% correctness of my implementation (though I’d appreciate any hints if something seems terribly wrong!), but I’d be surprised if it was as simple as adding a softmax activation.