Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Model stops training with variable-size dataset

See original GitHub issue

System information.

Have I written custom code (as opposed to using a stock example script provided in Keras): yes, but very simple case
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Colab
TensorFlow installed from (source or binary): Colab default
TensorFlow version (use command below): v2.7.0-0-gc256c071bb2 2.7.0
Python version: 3
Bazel version (if compiling from source): no
GPU model and memory: no
Exact command to reproduce: https://colab.research.google.com/drive/1fY4v9WBRxfsywDyKKidu-lmFpaPdAn9D?usp=sharing

Describe the problem.

In real case I use tf.data.Dataset (based on tensorflow_datasets) instance to train model. One big difference from default examples of keras.Model.fit + Dataset is unknown (variable) dataset length. In my case dataset length is variable (± 20%) because i make some random augmentations with filtering out some of them. See provided colab link to see what i mean.

As result when first epoch is finished (dataset has reached OutOfRangeError), keras remembers current step an if the same dataset on the next epoch has smaller length, all model training will be stopped.

Describe the current behavior. Model stops training if second/third/etc dataset iterator has length smaller then first one.

Describe the expected behavior. Model should not stop training. It can print warning, but not stop it.

Do you want to contribute a PR? (yes/no): no

Standalone code to reproduce the issue. https://colab.research.google.com/drive/1fY4v9WBRxfsywDyKKidu-lmFpaPdAn9D?usp=sharing

Source code / logs.

model.fit(dataset, epochs=100)

# Epoch 1/15
# 819/819 [==============================] - 2s 1ms/step - loss: 1.3987
# Epoch 2/15
# 819/819 [==============================] - 1s 1ms/step - loss: 1.0563
# Epoch 3/15
# 819/819 [==============================] - 1s 1ms/step - loss: 1.0262
# Epoch 4/15
# 819/819 [==============================] - 1s 1ms/step - loss: 1.0156
# Epoch 5/15
# 782/819 [===========================>..] - ETA: 0s - loss: 1.0146WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 12285 batches). You may need to use the repeat() function when building your dataset.
# 819/819 [==============================] - 1s 1ms/step - loss: 1.0161

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

haifeng-jincommented, Dec 9, 2021

Waiting for triage. Summary: When the dataset has a different number of samples from epoch to epoch (the batch size are the same, the number of steps are different), the training will stop at a epoch, whose number of steps is different from the first epoch.

0reactions

shkarupa-alexcommented, Jul 1, 2022

Got same issue when implementing word2vec model. Dataset size changes from epoch to epoch due to:

randomness in skipgram/cbow context size
randomness in downsampling with threshold

Single estimation number of batches takes around 4 hours (very large dataset). And this size can changes with ± 20% from epoch to epoch.

So setting steps_per_epoch is not a good option. It would be great if keras.Model will always look at OutOfRangeError itself.

Top Results From Across the Web

Use Early Stopping to Halt the Training of Neural Networks At ...

Early stopping requires that a validation dataset is evaluated during training. This can be achieved by specifying the validation dataset to the ...

4. Model Training Patterns - Machine Learning Design ...

The model fitting loops over the training dataset three times (each traversal over ... is no unobserved variable, no noise, and no statistical...

Train and predict on variable length sequences - Stack Overflow

You could re-frame your problem to have a fixed sequence length. Instead of trying to fit your net work on (batch, seq (between...

Training and evaluation with the built-in methods - TensorFlow

We call fit() , which will train the model by slicing the data into "batches" of size batch_size , and repeatedly iterating over...

A bunch of tips and tricks for training deep neural networks

If your dataset in your problem domain is similar to ImageNet dataset, use a pre-trained model on this dataset. The most widely used...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Model stops training with variable-size dataset

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Many metrics cannot handle predictions out of [0..1] range

Reorder channel layer for smooth native RGB - BGR