question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ModelCheckpoint files not closed

See original GitHub issue

System information. tf_env.txt

  • Custom code (see below)
  • MacOS 12.3.1
  • TensorFlow installed via pip
  • TensorFlow version: 2.9.1
  • Python version: 3.8.11
  • GPU model and memory: 8-Core Intel Core i9, 2,3 GHz, 16 GB RAM

Describe the problem. When loading weights for a model and tf.keras.callbacks.ModelCheckpoint to store weights after each epoch in training, the checkpoint files remains open. With many repetitions I eventually run out of resources (too many open files). (My dataset is large and split into batches)

Describe the current behavior. After training is completed (using model.fit) the checkpoint files is never closed.

Describe the expected behavior. After model.fit has completed the checkpoint files is should be closed.

Contributing.

  • Do you want to contribute a PR? (yes/no): no

Standalone code to reproduce the issue. This snipped will write all checkpoint files stil open after model.fit has completed

import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model, Sequential
import psutil

print('tf version: ', tf.version.VERSION)

def train(model, X, y):
    model_dir = './test'
    model_path = './test/cp.ckpt'
    checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=model_path, save_weights_only=True, verbose=0)
    try: 
        latest = tf.train.latest_checkpoint(model_dir)
        model.load_weights(latest)
    except AttributeError:
        print('First run -not reading checkpoint')
    history = model.fit(X, y, batch_size=32, epochs=10, verbose=0, callbacks=[checkpoint_callback])


X = pd.DataFrame([1],[2],[3])
y = pd.DataFrame([1],[2],[3])
model = Sequential(
    [
        Input(shape=(1,)),
        Dense(1)
    ]
)
model.compile(loss='binary_crossentropy', metrics=['accuracy'])

for _ in range(5):
    history = train(model, X, y)

print('Open files:')
proc = psutil.Process()
for file in proc.open_files():
    print(file[0])

Output.

Open files:
/test/cp.ckpt.index
/test/cp.ckpt.data-00000-of-00001
/test/cp.ckpt.index
/test/cp.ckpt.data-00000-of-00001
/test/cp.ckpt.index
/test/cp.ckpt.data-00000-of-00001
/test/cp.ckpt.index
/test/cp.ckpt.data-00000-of-00001
/test/cp.ckpt.index
/test/cp.ckpt.data-00000-of-00001

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
hertschuhcommented, Oct 14, 2022

@jonasrundberg After a bit of investigation, it turns out that the open files come from model.load_weights(latest) and not ModelCheckpoint.

I’ll keep investigating why this happens.

0reactions
jonasrundbergcommented, Dec 9, 2022

Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Keras ModelCheckpoint doesn't save any files on Windows
ModelCheckpoint but even though it prints out Epoch 00001: saving model to cp.ckpt etc. it doesn't create any files. Any ideas?
Read more >
ModelCheckpoint - Keras
ModelCheckpoint callback is used in conjunction with training using model.fit() to save a model or weights (in a checkpoint file) at some interval, ......
Read more >
Keras Callbacks and How to Save Your Model from Overtraining
In this article, you will learn how to use the ModelCheckpoint callback in Keras to save the best version of your model during...
Read more >
A quick complete tutorial to save and restore Tensorflow models
Now, instead of single .ckpt file, we have two files: ... Along with this, Tensorflow also has a file named checkpoint which simply...
Read more >
How to save our model to Google Drive and reuse it - Medium
Then you have to start again from the scratch, which is not optimal. ... Now, to save our model checkpoint (or any file),...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found