Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AutoModel fit does not save the best model for export later

See original GitHub issue

Bug Description

After finishing AutoModel fit when I try to export the model it complains about missing file. The error message is as below:

ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /ak_vanilla/trial_24d08bd5d9cf85fdcf31a67e75367d72/checkpoints/epoch_20/checkpoint: Not found: /ak_vanilla/trial_24d08bd5d9cf85fdcf31a67e75367d72/checkpoints/epoch_20; No such file or directory

When looking into the directory the folder for epoch_0 is available followed by epoch_21, epoch_22, … epoch_30 (max epoch). However, epoch_20 is missing. I am not sure why this behaviour occurs.

screenshot of the directory: https://prnt.sc/t6qacf

Bug Reproduction

import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.python.keras.utils.data_utils import Sequence
import autokeras as ak
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train[:100]
y_train = y_train[:100]
print(x_train.shape)  # (60000, 28, 28)
print(y_train.shape)  # (60000,)
print(y_train[:3])  # array([7, 2, 1], dtype=uint8)

# Initialize the image regressor.
reg = ak.ImageRegressor(
    overwrite=True,
    max_trials=10)
# Feed the image regressor with training data.
reg.fit(x_train, y_train, epochs=30)

mdl = reg.export_model()

Data used by the code:

Loading default mnist_data (shown in the code)

Expected Behavior

export_model() should export the Keras model. AutoModel.fit() should save all the epochs during training.

Setup Details

Ubuntu 18.04 Python 3.6.9 autokeras==1.0.3 keras-tuner==1.0.2rc0 sklearn==0.23.1 numpy==1.18.4 pandas==1.0.5 tensorflow==2.2.0

Issue Analytics

State:
Created 3 years ago
Comments:12 (2 by maintainers)

Top GitHub Comments

3reactions

I-Kryachkocommented, Jul 7, 2020

I think that you do not need to fix this by hardcoding the metric that you need in a certan task.

The problem for me was that autokeras could not get correct model checkpoint for epoch because it was looking for a deleted one and that is why i got the error when the trials finished and the final “best model” loop started to loop the best trials.

The deleting algo was described earlier and is in save_model method. Debugging this I noticed that epoch value differs from step value in console and in trial.json step value equals the epoch_value. Then I noticed that the deleted epoch is the previous to the first that was saved in my checkpoints directory. For example if i have best step=9 for trial then my best epoch number is 10 in console log and checkpoint is saved in epoch_9 directory. And the save_model method just delets my directory epoch_9 cause it starts to delete from the wrong epoch number. That is why I fixed the line in the method above: epoch_to_delete = epoch - self._save_n_checkpoints to this: epoch_to_delete = epoch - self._save_n_checkpoints - 1 and now my best checkpoints are stored correctly. Hope this helps you too.

2reactions

haifeng-jincommented, Jul 12, 2020

I examined this bug and have the fix in the is PR #1229 .