Help with LSTM
See original GitHub issueI am new to Keras and I am trying to create a few toy examples so that I get to know Keras better. I was trying to implement a LSTM that takes as inputs two binary strings (b1
and b2
, and returns the result of applying b1 OR b2
. I know this is not what you would generally use an LSTM for, but I’d still like to try this to get familiar with the different LSTM architectures
I’ve created a sequence-to-sequence model. Calling the binary strings a
and b
, and working with strings of 5 bits, we have the following architecture:
[y_0] [y_1] [y_2] [y_i] [y_n]
| | | | |
| | | | |
[h_0]----->[h_1]----->[h_2]----->[h_i]----->[h_n]
| | | | |
| | | | |
[a_0,b_0] [a_1,b_1] [a_2,b_2] [a_i,b_i] [a_n,b_n]
Where [a_i, b_i]
corresponds with the ith bit of both strings a
and b
. This way:
X_train = (100000, 5, 2) # [samples, time steps, features]
y_train = (100000, 5, 1)
X_test = (30000, 5, 2)
y_test = (30000, 5, 1)
I am creating the LSTM and fitting it with:
model = Sequential()
model.add(LSTM(5, input_dim=2, input_length=5, return_sequences=True))
model.add(TimeDistributed(Dense(1)))
model.compile(loss='mse', optimizer='rmsprop', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=100, nb_epoch=10,
validation_data=(X_test, y_test))
However, this doesn’t converge, this the output:
Loading data...
100000 train sequences
30000 test sequences
Data Shapes:
X_train: (100000, 5, 2)
y_train: (100000, 5, 1)
X_test: (30000, 5, 2)
y_test: (30000, 5, 1)
Build model...
Train...
Train on 100000 samples, validate on 30000 samples
Epoch 1/10
100000/100000 [==============================] - 8s - loss: 0.3251 - acc: 0.5748 - val_loss: 0.2094 - val_acc: 0.6819
Epoch 2/10
100000/100000 [==============================] - 9s - loss: 0.1963 - acc: 0.6852 - val_loss: 0.1860 - val_acc: 0.7068
Epoch 3/10
100000/100000 [==============================] - 9s - loss: 0.1772 - acc: 0.7252 - val_loss: 0.1706 - val_acc: 0.7396
Epoch 4/10
100000/100000 [==============================] - 8s - loss: 0.1686 - acc: 0.7394 - val_loss: 0.1665 - val_acc: 0.7453
Epoch 5/10
100000/100000 [==============================] - 8s - loss: 0.1654 - acc: 0.7425 - val_loss: 0.1639 - val_acc: 0.7455
Epoch 6/10
100000/100000 [==============================] - 9s - loss: 0.1634 - acc: 0.7428 - val_loss: 0.1620 - val_acc: 0.7457
Epoch 7/10
100000/100000 [==============================] - 8s - loss: 0.1615 - acc: 0.7471 - val_loss: 0.1601 - val_acc: 0.7511
Epoch 8/10
100000/100000 [==============================] - 9s - loss: 0.1594 - acc: 0.7545 - val_loss: 0.1581 - val_acc: 0.7557
Epoch 9/10
100000/100000 [==============================] - 8s - loss: 0.1576 - acc: 0.7553 - val_loss: 0.1566 - val_acc: 0.7573
Epoch 10/10
100000/100000 [==============================] - 8s - loss: 0.1564 - acc: 0.7554 - val_loss: 0.1559 - val_acc: 0.7585
29900/30000 [============================>.] - ETA: 0sTest score: 0.155859013249
Test accuracy: 0.758480000496
Any clues on what I am doing wrong? I’ve tried tweaking the hyperparams without much success. Should I be working with another LSTM architecture?
Issue Analytics
- State:
- Created 7 years ago
- Comments:6
Binary operations such as OR, AND, XOR etc are not good examples for RNNs, since there is no sequence / time dependency. Take a look at this page
http://www.xcprod.com/titan/XCSB-DOC/binary_or.html
You can see that each result bit is dependent on only the two bits to be OR’d, and not any surrounding bits. An RNN should be able to learn it, but it makes for a better example for static NNs, that’s why XOR is a ‘hello world’ example.
Binary addition is a better exercise, since it involves a carry bit which is dependent on the previous operation in the sequence. This is in numpy, not keras, but see this page and scroll down to ‘Our Toy Code’
https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/
On a separate note, are you generating the dataset at random? Because when n=5, 2^5 = 32 so there are only 32 5 bit sequences. So the number of possible OR operations with two 5 bit numbers is 32^2 = 1024, so that’s the max size of your dataset. So for low n, you could generate the entire set and then split by train / test / validation. By the time you get to n = 9 or so it will probably be worth going back to random.
I see, the amount of features is very low. Do you know if any other model has achieved a higher performance with those descriptors? Looks like you’ll have to change something in that aspect