Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ModelCheckpoint improvement from "remove & write" to "write & remove"

See original GitHub issue

Lately I experienced an issue with model checkpointing, so I wanted to move it to a discussion although I am unsure about this is a “bug”, and thus I opened as “question”. To sum it up, when using model checkpointing with certain a certain configuration, where n_saved=1, there is a potential risk to lose the checkpoint due to “remove first, and then write” logic.

Problem You can create a model checkpoint at specific timestamps to save a checkpoint. However, there is a potential risk that you can lose the checkpoint due to write errors, interruption, machine shut-down, or outage etc. DiskSaver used on ModelCheckpoint first checks if checkpoint count doesn’t exceed n_saved, and after that it removes old/older checkpoints, and then writes new checkpoint. When n_saved=1 this turns it into basic “remove existing & write”, and if by any case write process is corrupted, then since the older checkpoint is deleted, you simply waste your training resources and time. Option n_saved > 1 can create many dangling and redundant model checkpoints, and with large number of experiments and experimenting with huge models especially on cloud causes unnecessary files claiming large amount of storage space.

Ideas The trivial idea is to set n_saved > 1, but this has some negative consequences that people may not want generally and avoid it. The second idea is to replace “remove first and then write” logic to “write first and then remove” logic. Is there any possible practical ideas ?

NOTE: This issue is opened to discuss the situation, not meant to imply any feature request.

Issue Analytics

State:
Created 2 years ago
Comments:14 (9 by maintainers)

Top GitHub Comments

2reactions

bmartinncommented, May 7, 2021

Hi guys apologies for being so late to the discussion 😃

ClearML saver works with a remote storage and if i remember correctly they can not remove data from there and only overwrite it (maybe things have changed since then).

Actually starting version (v1.0.0) there is now an API to delete remote files (even on the default http server). This means that now ClearML will be able to support Option 2 as well. I actually like the fact that we avoid overwriting the remote files, because local file corruption can be though of as relatively rare, but network issues are more common and they might create a corrupt copy on the remote storage. I still believe that n_saved=2 is probably safer if you have an unstable system, but it is always good to have more options 😃

Regrading local storage policy, no way around that, if we want to increase reliability then at a certain point you will have two copies of the model. I do not think that this is a real issue, if you do not have enough free space for extra model file, your system will be unstable regardless (temp space these days is a must for many parts of the OS)

1reaction

devrimcavusoglucommented, May 10, 2021

Well with the following, we can simply reverse the order to write/remove and not implementing “lazy saving” (atomic would cover this I believe).

Actually starting version (v1.0.0) there is now an API to delete remote files (even on the default http server).

After this change and with n_saved=1, there may be no need to implement a lazy save for ClearMLSaver, but only to remark atomic option in the docs in a way that it can be used to prevent any corruption. So atomic option simply became more effective and it is a trivial solution (also for ClearMLSaver). However, for the following

for all clearml versions (if possible…)

I could not come up with any immediate solution yet.

Top Results From Across the Web

ModelCheckpoint should have a way to delete non-best ...

Only non-best checkpoints created by the current training session will be deleted. Resumed training from CheckpointA will not delete CheckpointA ...

Keras ModelCheckpoint overwrites previous best checkpoint ...

I am using ModelCheckpoint callback from Keras ...

tf.keras.callbacks.ModelCheckpoint | TensorFlow v2.11.0

ModelCheckpoint callback is used in conjunction with training using model.fit() to save a model or weights (in a checkpoint file) at some ...

Use Early Stopping to Halt the Training of Neural Networks At ...

Early stopping is a method that allows you to specify an arbitrary large number of training epochs and stop training once the model...

Removing Write Protection - EPLAN Help

Select the following commands: File > Revision control > Command group Project > Write protection > Remove write protection. Click [Yes] to confirm...