Reducing overfitting with embedding weights??
See original GitHub issue- Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
- If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with: pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
Hi, I’m working on text prediction task using my pretrained word embeddings. The current model is severely overfitting with increasing val loss. Need your advice on how to mitigate this and the train network properly. I’m using embedding weights in input layer. The embeddings were trained using a larger corpus. The training corpus of model is a subset of it. For out of vocab words I’m using glorot_uniform(270,) to get random embeddings.
Q1- Is it because my network parameters are far greater than train size ?
Q2- Am I using embeddings correctly? Do I have problem with embedding weights?
Q3- What else can I try?
Things I’ve tried:
- Passing train data as val data for bug check. Val loss decreases.
- Adding regularization to all layers.
- Reducing lstm cells from 256-128
- Varying input sequence length 5-32
My model
- Trained a word2vec model to get word embeddings.
- X_train: sequence of words (indices map using wordvec and oov dictionary)
- Using X_train to get embedding weights from word2vec model. i.e embed[i]=word2vec[X_train[i]]
- y_train- one hot target vector with 1 at pos of next word wrt to corpus indices i.e not using word2vec+oov dict.
Corpus size: 46265, Corpus vocab size: 22120
Out of vocab words: 15248
Vocab size of Word2vec model + oov words: 34444
Train corpus size: 46169
Test corpus size: 96
X_train.shape: (46137, 32)
y_train.shape: (46137, 22120)
embed_weight.shape: (34444, 270)
w2v_dim= 270
seq_length= 32
corpus_vocab_size= # of unique words in corpus
memory_units=128
model.add(Embedding(embed_weight.shape[0], embed_weight.shape[1], mask_zero=False, weights=[embed_weight], input_length=seq_length, W_regularizer=l2(l2_emb)))
model.add(LSTM(memory_units, return_sequences=False, init= "orthogonal", W_regularizer=l2(l2_lstm)))
model.add(Dropout(0.5))
model.add((Dense(corpus_vocab_size, activation='softmax', init= "orthogonal", W_regularizer=l2(l2_dense)))
Compiling Model
l2_emb: 0.7 l2_lstm: 0.7 l2_dense: 0.7 Dropout: 0.5
Fitting model
Train on 36909 samples, validate on 9228 samples
('lr:', array(0.0010000000474974513, dtype=float32))
Epoch 1/40
36909/36909 [==============================] - 135s - loss: 300351.6275 - acc: 0.0223 - val_loss: 9.8617 - val_acc: 0.0237
('lr:', array(0.0010000000474974513, dtype=float32))
Epoch 2/40
36909/36909 [==============================] - 134s - loss: 35422.2196 - acc: 0.0231 - val_loss: 9.9594 - val_acc: 0.0237
('lr:', array(0.0010000000474974513, dtype=float32))
Epoch 3/40
36909/36909 [==============================] - 135s - loss: 3996.1297 - acc: 0.0231 - val_loss: 10.1249 - val_acc: 0.0237
('lr:', array(0.0010000000474974513, dtype=float32))
Epoch 4/40
36909/36909 [==============================] - 134s - loss: 254.4328 - acc: 0.0229 - val_loss: 10.3708 - val_acc: 0.0237
('lr:', array(0.0010000000474974513, dtype=float32))
.
.
.
Epoch 38/40
36909/36909 [==============================] - 134s - loss: 9.0427 - acc: 0.0231 - val_loss: 11.7586 - val_acc: 0.0237
('lr:', array(0.0005000000237487257, dtype=float32))
Epoch 39/40
36909/36909 [==============================] - 134s - loss: 9.0425 - acc: 0.0231 - val_loss: 11.7566 - val_acc: 0.0237
('lr:', array(0.0005000000237487257, dtype=float32))
Epoch 40/40
36909/36909 [==============================] - 134s - loss: 9.0423 - acc: 0.0231 - val_loss: 11.7524 - val_acc: 0.0237
@braingineer @farizrahman4u @carlthome your two cents??
Need advice. Thanks !
Edit: optimizer=adam, loss=‘categorical_crossentropy’
Issue Analytics
- State:
- Created 7 years ago
- Comments:11 (6 by maintainers)
I must say that in my opinion using ob Embedding layer always leads to over-fitting simply because Embedding DRAMATICALLY increases number of free parameters to learn. Just suppose you have 500K vocabulary for each word 100 floats. In the other case when you use word2vec pretrained and FIXED representation number of free parameters is just equal to number of free parameters of your NN which often quite small.
@carlthome I could be missing something, but isn’t the Embedding layer as specified in the comment learned in the model? It seems to me that attaching this directly to randomly initiated layers would be a bad idea until those layers settle down.
On the other hand, if the word2vec transform is happening outside of the model (and not learnable) then it is not something to discuss.