question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

exp: Checkpoints created during `dvc exp run --temp` run are lost after failure (e.g., `kill -9`)

See original GitHub issue

Bug Report

Description

I have a long running training stage in my dvc.yaml, which uses DVCLive to track metrics and experiment checkpoints by specifying checkpoint: true for the PyTorch model .ckpt file created by PyTorch Lightnings ModelCheckpoint callback. When executing the training using dvc exp run --temp, it is run inside a temp folder created in .dvc/tmp/exps/standalone/. All checkpoint Git objects are stored under .dvc/tmp/exps/standalone/tmpXXX/.git/objects/. When the training process is interrupted (e.g., OOM, shared memory issue, failure to create new threads due to OS limits), DVC reports the error that ERROR: failed to reproduce 'train': failed to run: ... and exits. While doing so, it deletes the temp directory in .dvc/tmp/exps/standalone/ and along with it all previously created checkpoints. I cannot find the same checkpoint objects in the .git/objects folder of the workspace and am unable to recover those checkpoints.

Reproduce

  1. Create a dvc.yaml with train stage running a training script using DVCLive and checkpoints.
  2. Execute stage with dvc exp run --temp train.
  3. Wait a number of epochs until a few checkpoints were stored.
  4. Kill training process with kill -9.
  5. Check that .dvc/tmp/exps/standalone/tmpXXX folder is gone. No checkpoint objects in workspace (e.g., dvc exp show).

Expected

Checkpoints should be preserved to be able to recover from failures such as the ones mentioned in the description.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.31.0 (rpm)
---------------------------------
Platform: Python 3.8.3 on Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-glibc2.14
Subprojects:

Supports:
        azure (adlfs = None, knack = 0.10.0, azure-identity = 1.11.0),
        gdrive (pydrive2 = 1.14.0),
        gs (gcsfs = None),
        hdfs (fsspec = None, pyarrow = 9.0.0),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = None, boto3 = 1.24.59),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.7),
        webdavs (webdav4 = 0.9.7),
        webhdfs (fsspec = None)
Cache types: hardlink, symlink
Cache directory: xfs on /dev/md124
Caches: local, s3
Remotes: s3, s3
Workspace directory: xfs on /dev/md124
Repo: dvc (subdir), git

Additional Information (if any):

When interrupting the experiment with CTRL+C, the training script is set up to still return a zero exit code such that DVC considers the experiment as successfully executed. In this case, I expect the checkpoints to be preserved before the temp directory is being deleted (but I haven’t tested this yet).

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:15 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
dberenbaumcommented, Nov 29, 2022

I can reproduce with --temp but not with --queue. Can we make --temp behave like --queue?

1reaction
pmrowlacommented, Dec 8, 2022

Failed queued experiments are shown now as failed in the table and through the exp queue commands, but we are not saving any git commits for those failed exps, you just get a row showing which run failed (and you can now use queue logs to see the error logs as to why it failed)

But this only applies to --queue’d experiments.

Read more comments on GitHub >

github_iconTop Results From Across the Web

exp run | Data Version Control - DVC
Provides a way to execute and track experimentsexperiments in your projectproject without polluting it with unnecessary commits, branches, directories, etc.
Read more >
Debugging with GDB - sourceware.org
GDB can be used to debug programs written in Objective-C, using either the Apple/NeXT or the GNU Objective-C runtime. • Free Software: Freely...
Read more >
tc 21-305-20/afman 24-306(i) - Army Publishing Directorate
The brakes, tires, springs, and shock absorbers on heavy vehicles are designed to work best when the vehicle is fully loaded.
Read more >
ADVANCED CAMP CADET HANDBOOK - UNLV
Through a mix of education, training, and experience, Army ... PLATOON LEADER – The platoon leader leads his Cadets by personal example and...
Read more >
Intel® VTune™ Profiler User Guide
the driverless mode on Linux when hardware event-based sampling collection is run with stack analysis, for example, for Hotspots or Threading analysis types ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found