question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Keras Model does not converge

See original GitHub issue

Hi, I’m using Petastorm and trying to train a Keras (using tf) model.

I’ve created a dataset using materialize_dataset, then I’ve used make_reader to create a data iterarator. Using this iterator, I’m trying to train a Keras model, thought the model doesn’t seem to converge:

def get_data_iterator(dataset_path):
    with make_reader(dataset_path, num_epochs=None, cur_shard=0, shard_count=1) as reader:
        for row in reader:
           ....
            if i == batch_size:
                yield data

train_data = get_data_iterator(train_data_path)
history = model.fit_generator(train_data, steps_per_epoch=160,
                              epochs=20, validation_steps=8)

Epoch 1/20
160/160 [==============================] - 446s 3s/step - loss: 0.4831
Epoch 2/20
160/160 [==============================] - 415s 3s/step - loss: 0.4299
Epoch 3/20
160/160 [==============================] - 465s 3s/step - loss: 0.4509
Epoch 4/20
160/160 [==============================] - 456s 3s/step - loss: 0.4332
Epoch 5/20
160/160 [==============================] - 412s 3s/step - loss: 0.4385 
Epoch 6/20
160/160 [==============================] - 458s 3s/step - loss: 0.4074
Epoch 7/20
160/160 [==============================] - 689s 4s/step - loss: 0.4337
....

When I’m reading the iterator to memory, and then train the model, it does seem to converge:

def read_generator_to_memory(generator):
    X_train = np.zeros(...)
    y_train = np.zeros(...)
    i = 0
    for row in generator:
        if i == steps_per_epoch:
            break
    X, y = sample
    X_train[i*batch_size:(i+1)*batch_size] = X
    t_train[i*batch_size:(i+1)*batch_size] = y
    i += 1

train_data = get_data_iterator(train_data_path)
X_train, y_train = read_generator_to_memory(train_data)
history = model.fit(X_train, y_train, batch_size=32, epochs=20)

Epoch 1/20
4800/4800 [==============================] - 120s 25ms/sample - loss: 0.4831
Epoch 2/20
4800/4800 [==============================] - 120s 25ms/sample - loss: 0.3678
Epoch 3/20
4800/4800 [==============================] - 141s 29ms/sample - loss: 0.2921
...

Just to clarify, the entire dataset does not fit into memory, I just used a small part (~5K rows) of the dataset for these two training attempts, and for some reason the first training does not converge, but the second does, even though I would expect the results to be the same.

Any idea what could be the reason for that? Thanks, Stav

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
selitvincommented, Jun 19, 2019

This is concerning. I am not aware of any thread safety issues at the moment, but will try to reproduce your failure locally. Instead of training the model, I am planning to dump all the data that passes to Keras from some toy dataset and compare the bits Keras receives. Hope that to expose the issue, if there is one.

0reactions
stavshemcommented, Jun 18, 2019

I’m working with version 0.7.2, and it includes this commit.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why this model does not converge in keras?
So my question is why doesn't this model converge? I am thinking it could be due to its differentiability but really kinda lost...
Read more >
Neural network in keras not converging - Stack Overflow
I'm building a simple Neural network in Keras, like the following: # create model ...
Read more >
[TF 2.0] Model not converging when trained with custom ...
If I use the custom training loop, which manually computes gradients and applies them via an optimizer, the model does not converge.
Read more >
Pytorch not converge but keras did - vision
From what I can see though, your keras model converges to almost 0 training loss, whereas the PyTorch model seems to increase in...
Read more >
ResNet and Inception not converging? - ResearchGate
Also, you can try to overfit your model on only 1 observation. If you have a working algorithm, your CNN will of course...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found