exp: Checkpoints created during `dvc exp run --temp` run are lost after failure (e.g., `kill -9`)
See original GitHub issueBug Report
Description
I have a long running training stage in my dvc.yaml
, which uses DVCLive to track metrics and experiment checkpoints by specifying checkpoint: true
for the PyTorch model .ckpt
file created by PyTorch Lightnings ModelCheckpoint
callback. When executing the training using dvc exp run --temp
, it is run inside a temp folder created in .dvc/tmp/exps/standalone/
. All checkpoint Git objects are stored under .dvc/tmp/exps/standalone/tmpXXX/.git/objects/
. When the training process is interrupted (e.g., OOM, shared memory issue, failure to create new threads due to OS limits), DVC reports the error that ERROR: failed to reproduce 'train': failed to run: ...
and exits. While doing so, it deletes the temp directory in .dvc/tmp/exps/standalone/
and along with it all previously created checkpoints. I cannot find the same checkpoint objects in the .git/objects
folder of the workspace and am unable to recover those checkpoints.
Reproduce
- Create a
dvc.yaml
withtrain
stage running a training script using DVCLive and checkpoints. - Execute stage with
dvc exp run --temp train
. - Wait a number of epochs until a few checkpoints were stored.
- Kill training process with
kill -9
. - Check that
.dvc/tmp/exps/standalone/tmpXXX
folder is gone. No checkpoint objects in workspace (e.g.,dvc exp show
).
Expected
Checkpoints should be preserved to be able to recover from failures such as the ones mentioned in the description.
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.31.0 (rpm)
---------------------------------
Platform: Python 3.8.3 on Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-glibc2.14
Subprojects:
Supports:
azure (adlfs = None, knack = 0.10.0, azure-identity = 1.11.0),
gdrive (pydrive2 = 1.14.0),
gs (gcsfs = None),
hdfs (fsspec = None, pyarrow = 9.0.0),
http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
oss (ossfs = 2021.8.0),
s3 (s3fs = None, boto3 = 1.24.59),
ssh (sshfs = 2022.6.0),
webdav (webdav4 = 0.9.7),
webdavs (webdav4 = 0.9.7),
webhdfs (fsspec = None)
Cache types: hardlink, symlink
Cache directory: xfs on /dev/md124
Caches: local, s3
Remotes: s3, s3
Workspace directory: xfs on /dev/md124
Repo: dvc (subdir), git
Additional Information (if any):
When interrupting the experiment with CTRL+C, the training script is set up to still return a zero exit code such that DVC considers the experiment as successfully executed. In this case, I expect the checkpoints to be preserved before the temp directory is being deleted (but I haven’t tested this yet).
Issue Analytics
- State:
- Created 10 months ago
- Comments:15 (5 by maintainers)
Top GitHub Comments
I can reproduce with
--temp
but not with--queue
. Can we make--temp
behave like--queue
?Failed queued experiments are shown now as failed in the table and through the
exp queue
commands, but we are not saving any git commits for those failed exps, you just get a row showing which run failed (and you can now usequeue logs
to see the error logs as to why it failed)But this only applies to
--queue
’d experiments.