Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ImageClassifier missing checkpoint leads to saving model error

See original GitHub issue

Bug Description

After fitting a model with ImageClassifier, I can’t export the model because it is missing a checkpoint. I have checked the image_classifier folder generated and it’s true, it is trying to save a checkpoint that is not there. So it is not an issue about wrong paths.

Bug Error Message

1.0000Found 60 images belonging to 2 classes.
Found 60 images belonging to 2 classes.
30/30 [==============================] - 31s 1s/step - loss: 1.5799e-05 - accuracy: 1.0000 - val_loss: 0.7994 - val_accuracy: 0.8000
Traceback (most recent call last):
  File "covid19.py", line 77, in <module>
    model = clf.export_model()
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/autokeras/auto_model.py", line 454, in export_model
    return self.tuner.get_best_model()
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/autokeras/engine/tuner.py", line 50, in get_best_model
    model = super().get_best_models()[0]
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/kerastuner/engine/tuner.py", line 258, in get_best_models
    return super(Tuner, self).get_best_models(num_models)
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py", line 240, in get_best_models
    models = [self.load_model(trial) for trial in best_trials]
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py", line 240, in <listcomp>
    models = [self.load_model(trial) for trial in best_trials]
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/kerastuner/engine/tuner.py", line 183, in load_model
    model.load_weights(self._get_checkpoint_fname(
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 250, in load_weights
    return super(Model, self).load_weights(filepath, by_name, skip_mismatch)
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/tensorflow/python/keras/engine/network.py", line 1231, in load_weights
    py_checkpoint_reader.NewCheckpointReader(filepath)
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 95, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./image_classifier/trial_9d509ff4f26fbf458ebe2930ed9aa1fd/checkpoints/epoch_4/checkpoint: Not found: ./image_classifier/trial_9d509ff4f26fbf458ebe2930ed9aa1fd/checkpoints/epoch_4; No such file or directory

And as you can see:

(ak4) vicastro@dgx1:/raid/vicastro$ ls ./image_classifier/trial_9d509ff4f26fbf458ebe2930ed9aa1fd/checkpoints/
epoch_0   epoch_11  epoch_13  epoch_5  epoch_7  epoch_9
epoch_10  epoch_12  epoch_14  epoch_6  epoch_8

It is like it is designed to save 10 the last 10 epochs or something like that and tries to save one epoch more than expected. Because 4 to 14, both inclusive, it’s a total amount of 11 epochs.

Bug Reproduction

—data generated with generators and ImageDataGenerator, but fitting works perfectly-----

train_dataset = tf.data.Dataset.from_generator(get_train_generator,output_types=('uint8', 'uint8'),
    output_shapes=(tf.TensorShape((None, 256, 256, 1)), tf.TensorShape((None,2))))

val_dataset = tf.data.Dataset.from_generator(get_val_generator,output_types=('uint8', 'uint8'),
    output_shapes=(tf.TensorShape((None, 256, 256, 1)), tf.TensorShape((None,2))))

clf = ak.ImageClassifier(max_trials=2)
clf.fit(train_dataset, validation_data = val_dataset, epochs=10)
model = clf.export_model()
model.save("./model.h5")

Data used by the code: Greyscale images, can’t provide the dataset since it is not public yet.

Expected Behavior

Fit a model, export and save it.

Setup Details

Include the details about the versions of:

OS type and version:
Python: 3.8.3
autokeras: 1.0.4
keras-tuner: 1.0.2rc1
scikit-learn: 0.23.1
numpy: 1.18.15
pandas: 1.0.5
tensorflow: 2.2.0

Additional context

It fails also with autokeras 1.0.3 and previous version of keras-tuner. I am confused because 1 mont ago I was doing this with mnist and worked nicely.

Also, if I set epochs=10 it works, but with more epochs I have the error mentioned.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:8 (1 by maintainers)

Top GitHub Comments

1reaction

SivamPillaicommented, Sep 4, 2020

I have tested this with v1.0.5 and TF 2.3.0. I can confirm that the issue is completely resolved. Also closed #1210

1reaction

ChanChiChoicommented, Jul 27, 2020

https://github.com/keras-team/autokeras/issues/1210
will be helpful, now, a simply way is comment out the delete portion

Top Results From Across the Web

Loading model from checkpoint after error in training - Beginners

Hi, I have a question. I tried to load weights from a checkpoint like below. config = AutoConfig.from_pretrained("./saved/checkpoint-480000") ...

python - loading model failed in torchserving - Stack Overflow

You are loading a checkpoint, which would work if your model was saved like this: torch.save({ 'epoch': epoch, 'model_state_dict': ...

Model Parallel Troubleshooting - Amazon SageMaker

Saving Checkpoints. You might run into the following error when saving checkpoints of a large model on SageMaker:.

A Guide To Using Checkpoints — Ray 2.2.0

Trial-level checkpoints capture the per-trial state. They are saved by the trainable itself. Commonly, this includes the model and optimizer states. This is ......

Creating an Image Classifier Model - Apple Developer

After the image classifier finishes training, you assess its accuracy and, if it performs well enough, save it as a Core ML model...