ImageClassifier missing checkpoint leads to saving model error
See original GitHub issueBug Description
After fitting a model with ImageClassifier, I can’t export the model because it is missing a checkpoint. I have checked the image_classifier folder generated and it’s true, it is trying to save a checkpoint that is not there. So it is not an issue about wrong paths.
Bug Error Message
1.0000Found 60 images belonging to 2 classes.
Found 60 images belonging to 2 classes.
30/30 [==============================] - 31s 1s/step - loss: 1.5799e-05 - accuracy: 1.0000 - val_loss: 0.7994 - val_accuracy: 0.8000
Traceback (most recent call last):
File "covid19.py", line 77, in <module>
model = clf.export_model()
File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/autokeras/auto_model.py", line 454, in export_model
return self.tuner.get_best_model()
File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/autokeras/engine/tuner.py", line 50, in get_best_model
model = super().get_best_models()[0]
File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/kerastuner/engine/tuner.py", line 258, in get_best_models
return super(Tuner, self).get_best_models(num_models)
File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py", line 240, in get_best_models
models = [self.load_model(trial) for trial in best_trials]
File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py", line 240, in <listcomp>
models = [self.load_model(trial) for trial in best_trials]
File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/kerastuner/engine/tuner.py", line 183, in load_model
model.load_weights(self._get_checkpoint_fname(
File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 250, in load_weights
return super(Model, self).load_weights(filepath, by_name, skip_mismatch)
File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/tensorflow/python/keras/engine/network.py", line 1231, in load_weights
py_checkpoint_reader.NewCheckpointReader(filepath)
File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 95, in NewCheckpointReader
return CheckpointReader(compat.as_bytes(filepattern))
ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./image_classifier/trial_9d509ff4f26fbf458ebe2930ed9aa1fd/checkpoints/epoch_4/checkpoint: Not found: ./image_classifier/trial_9d509ff4f26fbf458ebe2930ed9aa1fd/checkpoints/epoch_4; No such file or directory
And as you can see:
(ak4) vicastro@dgx1:/raid/vicastro$ ls ./image_classifier/trial_9d509ff4f26fbf458ebe2930ed9aa1fd/checkpoints/
epoch_0 epoch_11 epoch_13 epoch_5 epoch_7 epoch_9
epoch_10 epoch_12 epoch_14 epoch_6 epoch_8
It is like it is designed to save 10 the last 10 epochs or something like that and tries to save one epoch more than expected. Because 4 to 14, both inclusive, it’s a total amount of 11 epochs.
Bug Reproduction
—data generated with generators and ImageDataGenerator, but fitting works perfectly-----
train_dataset = tf.data.Dataset.from_generator(get_train_generator,output_types=('uint8', 'uint8'),
output_shapes=(tf.TensorShape((None, 256, 256, 1)), tf.TensorShape((None,2))))
val_dataset = tf.data.Dataset.from_generator(get_val_generator,output_types=('uint8', 'uint8'),
output_shapes=(tf.TensorShape((None, 256, 256, 1)), tf.TensorShape((None,2))))
clf = ak.ImageClassifier(max_trials=2)
clf.fit(train_dataset, validation_data = val_dataset, epochs=10)
model = clf.export_model()
model.save("./model.h5")
Data used by the code: Greyscale images, can’t provide the dataset since it is not public yet.
Expected Behavior
Fit a model, export and save it.
Setup Details
Include the details about the versions of:
- OS type and version:
- Python: 3.8.3
- autokeras: 1.0.4
- keras-tuner: 1.0.2rc1
- scikit-learn: 0.23.1
- numpy: 1.18.15
- pandas: 1.0.5
- tensorflow: 2.2.0
Additional context
It fails also with autokeras 1.0.3 and previous version of keras-tuner. I am confused because 1 mont ago I was doing this with mnist and worked nicely.
Also, if I set epochs=10 it works, but with more epochs I have the error mentioned.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:8 (1 by maintainers)
Top GitHub Comments
I have tested this with v1.0.5 and TF 2.3.0. I can confirm that the issue is completely resolved. Also closed #1210
https://github.com/keras-team/autokeras/issues/1210
will be helpful, now, a simply way is comment out the delete portion