question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ImageClassifier missing checkpoint leads to saving model error

See original GitHub issue

Bug Description

After fitting a model with ImageClassifier, I can’t export the model because it is missing a checkpoint. I have checked the image_classifier folder generated and it’s true, it is trying to save a checkpoint that is not there. So it is not an issue about wrong paths.

Bug Error Message

1.0000Found 60 images belonging to 2 classes.
Found 60 images belonging to 2 classes.
30/30 [==============================] - 31s 1s/step - loss: 1.5799e-05 - accuracy: 1.0000 - val_loss: 0.7994 - val_accuracy: 0.8000
Traceback (most recent call last):
  File "covid19.py", line 77, in <module>
    model = clf.export_model()
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/autokeras/auto_model.py", line 454, in export_model
    return self.tuner.get_best_model()
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/autokeras/engine/tuner.py", line 50, in get_best_model
    model = super().get_best_models()[0]
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/kerastuner/engine/tuner.py", line 258, in get_best_models
    return super(Tuner, self).get_best_models(num_models)
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py", line 240, in get_best_models
    models = [self.load_model(trial) for trial in best_trials]
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py", line 240, in <listcomp>
    models = [self.load_model(trial) for trial in best_trials]
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/kerastuner/engine/tuner.py", line 183, in load_model
    model.load_weights(self._get_checkpoint_fname(
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 250, in load_weights
    return super(Model, self).load_weights(filepath, by_name, skip_mismatch)
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/tensorflow/python/keras/engine/network.py", line 1231, in load_weights
    py_checkpoint_reader.NewCheckpointReader(filepath)
  File "/mnt/sdd/vicastro/envs/ak4/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 95, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./image_classifier/trial_9d509ff4f26fbf458ebe2930ed9aa1fd/checkpoints/epoch_4/checkpoint: Not found: ./image_classifier/trial_9d509ff4f26fbf458ebe2930ed9aa1fd/checkpoints/epoch_4; No such file or directory

And as you can see:

(ak4) vicastro@dgx1:/raid/vicastro$ ls ./image_classifier/trial_9d509ff4f26fbf458ebe2930ed9aa1fd/checkpoints/
epoch_0   epoch_11  epoch_13  epoch_5  epoch_7  epoch_9
epoch_10  epoch_12  epoch_14  epoch_6  epoch_8

It is like it is designed to save 10 the last 10 epochs or something like that and tries to save one epoch more than expected. Because 4 to 14, both inclusive, it’s a total amount of 11 epochs.

Bug Reproduction

—data generated with generators and ImageDataGenerator, but fitting works perfectly-----

train_dataset = tf.data.Dataset.from_generator(get_train_generator,output_types=('uint8', 'uint8'),
    output_shapes=(tf.TensorShape((None, 256, 256, 1)), tf.TensorShape((None,2))))

val_dataset = tf.data.Dataset.from_generator(get_val_generator,output_types=('uint8', 'uint8'),
    output_shapes=(tf.TensorShape((None, 256, 256, 1)), tf.TensorShape((None,2))))

clf = ak.ImageClassifier(max_trials=2)
clf.fit(train_dataset, validation_data = val_dataset, epochs=10)
model = clf.export_model()
model.save("./model.h5")

Data used by the code: Greyscale images, can’t provide the dataset since it is not public yet.

Expected Behavior

Fit a model, export and save it.

Setup Details

Include the details about the versions of:

  • OS type and version:
  • Python: 3.8.3
  • autokeras: 1.0.4
  • keras-tuner: 1.0.2rc1
  • scikit-learn: 0.23.1
  • numpy: 1.18.15
  • pandas: 1.0.5
  • tensorflow: 2.2.0

Additional context

It fails also with autokeras 1.0.3 and previous version of keras-tuner. I am confused because 1 mont ago I was doing this with mnist and worked nicely.

Also, if I set epochs=10 it works, but with more epochs I have the error mentioned.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:8 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
SivamPillaicommented, Sep 4, 2020

I have tested this with v1.0.5 and TF 2.3.0. I can confirm that the issue is completely resolved. Also closed #1210

1reaction
ChanChiChoicommented, Jul 27, 2020

https://github.com/keras-team/autokeras/issues/1210
will be helpful, now, a simply way is comment out the delete portion

Read more comments on GitHub >

github_iconTop Results From Across the Web

Loading model from checkpoint after error in training - Beginners
Hi, I have a question. I tried to load weights from a checkpoint like below. config = AutoConfig.from_pretrained("./saved/checkpoint-480000") ...
Read more >
python - loading model failed in torchserving - Stack Overflow
You are loading a checkpoint, which would work if your model was saved like this: torch.save({ 'epoch': epoch, 'model_state_dict': ...
Read more >
Model Parallel Troubleshooting - Amazon SageMaker
Saving Checkpoints. You might run into the following error when saving checkpoints of a large model on SageMaker:.
Read more >
A Guide To Using Checkpoints — Ray 2.2.0
Trial-level checkpoints capture the per-trial state. They are saved by the trainable itself. Commonly, this includes the model and optimizer states. This is ......
Read more >
Creating an Image Classifier Model - Apple Developer
After the image classifier finishes training, you assess its accuracy and, if it performs well enough, save it as a Core ML model...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found