ModelCheckpoint improvement from "remove & write" to "write & remove"
See original GitHub issueLately I experienced an issue with model checkpointing, so I wanted to move it to a discussion although I am unsure about this is a “bug”, and thus I opened as “question”. To sum it up, when using model checkpointing with certain a certain configuration, where
n_saved=1
, there is a potential risk to lose the checkpoint due to “remove first, and then write” logic.
Problem
You can create a model checkpoint at specific timestamps to save a checkpoint. However, there is a potential risk that you can lose the checkpoint due to write errors, interruption, machine shut-down, or outage etc. DiskSaver used on ModelCheckpoint first checks if checkpoint count doesn’t exceed n_saved
, and after that it removes old/older checkpoints, and then writes new checkpoint. When n_saved=1
this turns it into basic “remove existing & write”, and if by any case write process is corrupted, then since the older checkpoint is deleted, you simply waste your training resources and time. Option n_saved > 1
can create many dangling and redundant model checkpoints, and with large number of experiments and experimenting with huge models especially on cloud causes unnecessary files claiming large amount of storage space.
Ideas
The trivial idea is to set n_saved > 1
, but this has some negative consequences that people may not want generally and avoid it. The second idea is to replace “remove first and then write” logic to “write first and then remove” logic. Is there any possible practical ideas ?
NOTE: This issue is opened to discuss the situation, not meant to imply any feature request.
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (9 by maintainers)
Hi guys apologies for being so late to the discussion 😃
Actually starting version (v1.0.0) there is now an API to delete remote files (even on the default http server). This means that now
ClearML
will be able to support Option 2 as well. I actually like the fact that we avoid overwriting the remote files, because local file corruption can be though of as relatively rare, but network issues are more common and they might create a corrupt copy on the remote storage. I still believe thatn_saved=2
is probably safer if you have an unstable system, but it is always good to have more options 😃Regrading local storage policy, no way around that, if we want to increase reliability then at a certain point you will have two copies of the model. I do not think that this is a real issue, if you do not have enough free space for extra model file, your system will be unstable regardless (temp space these days is a must for many parts of the OS)
Well with the following, we can simply reverse the order to write/remove and not implementing “lazy saving” (atomic would cover this I believe).
After this change and with
n_saved=1
, there may be no need to implement a lazy save for ClearMLSaver, but only to remarkatomic
option in the docs in a way that it can be used to prevent any corruption. Soatomic
option simply became more effective and it is a trivial solution (also for ClearMLSaver). However, for the followingI could not come up with any immediate solution yet.