question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to add Attention on top of a Recurrent Layer (Text Classification)

See original GitHub issue

I am doing text classification. Also I am using my pre-trained word embeddings and i have a LSTM layer on top with a softmax at the end.

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

Pretty simple. Now I want to add attention to the model, but i don’t know how to do it.

My understanding is that i have to set return_sequences=True so as the attention layer will weigh each timestep accordingly. This way the LSTM will return a 3D Tensor, right? After that what do i have to do? Is there a way to easily implement a model with attention using Keras Layers or do i have to write my own custom layer?

If this can be done with the available Keras Layers, I would really appreciate an example.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:42
  • Comments:116 (20 by maintainers)

github_iconTop GitHub Comments

99reactions
patyorkcommented, Jan 7, 2017

@baziotis This area is supposed to be more for bugs as opposed to “how to implement” questions. I admit I don’t often look at the google group, but that is a valid place to ask these questions, as well as on the Slack channel.

Bengio et. al has a pretty good paper on attention (soft attention is the softmax attention).

An example of method a) I described:

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False))
model.add(Activation('softmax')) #this guy here
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

example b), with simple activation:

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False, activation='softmax'))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

example b) with sigmoid and then softmax (non-working, but the idea):

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

def myAct(out):
    return K.softmax(K.tanh(out))

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False, activation=myAct))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

In addition, I should say that my notes about whether a) or b) above is what you probably need are based on your example, where you want one output (making option b probably the correct way). Attention is often used in spaces like caption generation where there is more than 1 output such as setting return_sequences=True. For those cases, I think that option a) is the described usage, such that the recurrency keeps all the information passing forward, and it’s just the higher layers that utilize the attention.

75reactions
mbollmanncommented, Jan 10, 2017

@patyork, I’m sorry, but I don’t see how this implements attention at all?

From my understanding, the softmax in the Bengio et al. paper is not applied over the LSTM output, but over the output of an attention model, which is calculated from the LSTM’s hidden state at a given timestep. The output of the softmax is then used to modify the LSTM’s internal state. Essentially, attention is something that happens within an LSTM since it is both based on and modifies its internal states.

I actually made my own attempt to create an attentional LSTM in Keras, based on the very same paper you cited, which I’ve shared here:

https://gist.github.com/mbollmann/ccc735366221e4dba9f89d2aab86da1e

There are several different ways to incorporate attention into an LSTM, and I won’t claim 100% correctness of my implementation (though I’d appreciate any hints if something seems terribly wrong!), but I’d be surprised if it was as simple as adding a softmax activation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Adding a Custom Attention Layer to a Recurrent Neural ...
This tutorial shows how to add a custom attention layer to a network built using a recurrent neural network. We'll illustrate an end-to-end ......
Read more >
Text Classification, Part 2 - sentence level Attentional RNN
In the second post, I will try to tackle the problem by using recurrent neural network and attention based LSTM encoder.
Read more >
Text Classification using Attention Mechanism in Keras
In this tutorial, We build text classification models in Keras that use attention mechanism to provide insight ... Create Attention Layer.
Read more >
Attention Mechanisms With Keras - Paperspace Blog
This tutorial covers what attention mechanisms are, different types of attention mechanisms, and how to implement an attention mechanism with Keras.
Read more >
Write your own custom Attention layer: Easy, intuitive guide
(2016) introduce an attention mechanism that takes two sentences and outputs a single vector. Another take on this is Attention-over-Attention — ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found