Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Keras Model does not converge

See original GitHub issue

Hi, I’m using Petastorm and trying to train a Keras (using tf) model.

I’ve created a dataset using materialize_dataset, then I’ve used make_reader to create a data iterarator. Using this iterator, I’m trying to train a Keras model, thought the model doesn’t seem to converge:

def get_data_iterator(dataset_path):
    with make_reader(dataset_path, num_epochs=None, cur_shard=0, shard_count=1) as reader:
        for row in reader:
           ....
            if i == batch_size:
                yield data

train_data = get_data_iterator(train_data_path)
history = model.fit_generator(train_data, steps_per_epoch=160,
                              epochs=20, validation_steps=8)

Epoch 1/20
160/160 [==============================] - 446s 3s/step - loss: 0.4831
Epoch 2/20
160/160 [==============================] - 415s 3s/step - loss: 0.4299
Epoch 3/20
160/160 [==============================] - 465s 3s/step - loss: 0.4509
Epoch 4/20
160/160 [==============================] - 456s 3s/step - loss: 0.4332
Epoch 5/20
160/160 [==============================] - 412s 3s/step - loss: 0.4385 
Epoch 6/20
160/160 [==============================] - 458s 3s/step - loss: 0.4074
Epoch 7/20
160/160 [==============================] - 689s 4s/step - loss: 0.4337
....

When I’m reading the iterator to memory, and then train the model, it does seem to converge:

def read_generator_to_memory(generator):
    X_train = np.zeros(...)
    y_train = np.zeros(...)
    i = 0
    for row in generator:
        if i == steps_per_epoch:
            break
    X, y = sample
    X_train[i*batch_size:(i+1)*batch_size] = X
    t_train[i*batch_size:(i+1)*batch_size] = y
    i += 1

train_data = get_data_iterator(train_data_path)
X_train, y_train = read_generator_to_memory(train_data)
history = model.fit(X_train, y_train, batch_size=32, epochs=20)

Epoch 1/20
4800/4800 [==============================] - 120s 25ms/sample - loss: 0.4831
Epoch 2/20
4800/4800 [==============================] - 120s 25ms/sample - loss: 0.3678
Epoch 3/20
4800/4800 [==============================] - 141s 29ms/sample - loss: 0.2921
...

Just to clarify, the entire dataset does not fit into memory, I just used a small part (~5K rows) of the dataset for these two training attempts, and for some reason the first training does not converge, but the second does, even though I would expect the results to be the same.

Any idea what could be the reason for that? Thanks, Stav

Issue Analytics

State:
Created 4 years ago
Comments:5

Top GitHub Comments

1reaction

selitvincommented, Jun 19, 2019

This is concerning. I am not aware of any thread safety issues at the moment, but will try to reproduce your failure locally. Instead of training the model, I am planning to dump all the data that passes to Keras from some toy dataset and compare the bits Keras receives. Hope that to expose the issue, if there is one.

0reactions

stavshemcommented, Jun 18, 2019

I’m working with version 0.7.2, and it includes this commit.