AutoModel fit does not save the best model for export later
See original GitHub issueBug Description
After finishing AutoModel fit when I try to export the model it complains about missing file. The error message is as below:
ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /ak_vanilla/trial_24d08bd5d9cf85fdcf31a67e75367d72/checkpoints/epoch_20/checkpoint: Not found: /ak_vanilla/trial_24d08bd5d9cf85fdcf31a67e75367d72/checkpoints/epoch_20; No such file or directory
When looking into the directory the folder for epoch_0 is available followed by epoch_21, epoch_22, … epoch_30 (max epoch). However, epoch_20 is missing. I am not sure why this behaviour occurs.
screenshot of the directory: https://prnt.sc/t6qacf
Bug Reproduction
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.python.keras.utils.data_utils import Sequence
import autokeras as ak
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train[:100]
y_train = y_train[:100]
print(x_train.shape) # (60000, 28, 28)
print(y_train.shape) # (60000,)
print(y_train[:3]) # array([7, 2, 1], dtype=uint8)
# Initialize the image regressor.
reg = ak.ImageRegressor(
overwrite=True,
max_trials=10)
# Feed the image regressor with training data.
reg.fit(x_train, y_train, epochs=30)
mdl = reg.export_model()
Data used by the code:
Loading default mnist_data (shown in the code)
Expected Behavior
export_model() should export the Keras model. AutoModel.fit() should save all the epochs during training.
Setup Details
Ubuntu 18.04 Python 3.6.9 autokeras==1.0.3 keras-tuner==1.0.2rc0 sklearn==0.23.1 numpy==1.18.4 pandas==1.0.5 tensorflow==2.2.0
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (2 by maintainers)
Top GitHub Comments
I think that you do not need to fix this by hardcoding the metric that you need in a certan task.
The problem for me was that autokeras could not get correct model checkpoint for epoch because it was looking for a deleted one and that is why i got the error when the trials finished and the final “best model” loop started to loop the best trials.
The deleting algo was described earlier and is in
save_model
method. Debugging this I noticed thatepoch
value differs fromstep
value in console and in trial.json step value equals the epoch_value. Then I noticed that the deleted epoch is the previous to the first that was saved in my checkpoints directory. For example if i have best step=9 for trial then my best epoch number is 10 in console log and checkpoint is saved in epoch_9 directory. And thesave_model
method just delets my directory epoch_9 cause it starts to delete from the wrong epoch number. That is why I fixed the line in the method above:epoch_to_delete = epoch - self._save_n_checkpoints
to this:epoch_to_delete = epoch - self._save_n_checkpoints - 1
and now my best checkpoints are stored correctly. Hope this helps you too.I examined this bug and have the fix in the is PR #1229 .