Keras Model does not converge
See original GitHub issueHi, I’m using Petastorm and trying to train a Keras (using tf) model.
I’ve created a dataset using materialize_dataset
, then I’ve used make_reader
to create a data iterarator.
Using this iterator, I’m trying to train a Keras model, thought the model doesn’t seem to converge:
def get_data_iterator(dataset_path):
with make_reader(dataset_path, num_epochs=None, cur_shard=0, shard_count=1) as reader:
for row in reader:
....
if i == batch_size:
yield data
train_data = get_data_iterator(train_data_path)
history = model.fit_generator(train_data, steps_per_epoch=160,
epochs=20, validation_steps=8)
Epoch 1/20
160/160 [==============================] - 446s 3s/step - loss: 0.4831
Epoch 2/20
160/160 [==============================] - 415s 3s/step - loss: 0.4299
Epoch 3/20
160/160 [==============================] - 465s 3s/step - loss: 0.4509
Epoch 4/20
160/160 [==============================] - 456s 3s/step - loss: 0.4332
Epoch 5/20
160/160 [==============================] - 412s 3s/step - loss: 0.4385
Epoch 6/20
160/160 [==============================] - 458s 3s/step - loss: 0.4074
Epoch 7/20
160/160 [==============================] - 689s 4s/step - loss: 0.4337
....
When I’m reading the iterator to memory, and then train the model, it does seem to converge:
def read_generator_to_memory(generator):
X_train = np.zeros(...)
y_train = np.zeros(...)
i = 0
for row in generator:
if i == steps_per_epoch:
break
X, y = sample
X_train[i*batch_size:(i+1)*batch_size] = X
t_train[i*batch_size:(i+1)*batch_size] = y
i += 1
train_data = get_data_iterator(train_data_path)
X_train, y_train = read_generator_to_memory(train_data)
history = model.fit(X_train, y_train, batch_size=32, epochs=20)
Epoch 1/20
4800/4800 [==============================] - 120s 25ms/sample - loss: 0.4831
Epoch 2/20
4800/4800 [==============================] - 120s 25ms/sample - loss: 0.3678
Epoch 3/20
4800/4800 [==============================] - 141s 29ms/sample - loss: 0.2921
...
Just to clarify, the entire dataset does not fit into memory, I just used a small part (~5K rows) of the dataset for these two training attempts, and for some reason the first training does not converge, but the second does, even though I would expect the results to be the same.
Any idea what could be the reason for that? Thanks, Stav
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top Results From Across the Web
Why this model does not converge in keras?
So my question is why doesn't this model converge? I am thinking it could be due to its differentiability but really kinda lost...
Read more >Neural network in keras not converging - Stack Overflow
I'm building a simple Neural network in Keras, like the following: # create model ...
Read more >[TF 2.0] Model not converging when trained with custom ...
If I use the custom training loop, which manually computes gradients and applies them via an optimizer, the model does not converge.
Read more >Pytorch not converge but keras did - vision
From what I can see though, your keras model converges to almost 0 training loss, whereas the PyTorch model seems to increase in...
Read more >ResNet and Inception not converging? - ResearchGate
Also, you can try to overfit your model on only 1 observation. If you have a working algorithm, your CNN will of course...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This is concerning. I am not aware of any thread safety issues at the moment, but will try to reproduce your failure locally. Instead of training the model, I am planning to dump all the data that passes to Keras from some toy dataset and compare the bits Keras receives. Hope that to expose the issue, if there is one.
I’m working with version 0.7.2, and it includes this commit.