question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`exp run`: `--temp` `--rev` does not properly resume from the target revision

See original GitHub issue

Bug Report

Description

The command appears to use whatever is checked out in the workspace to resume from, despite providing a different --rev

Reproduce

Clone the worked example repository and run the reprduction script

git clone https://github.com/mattlbeck/dvc-exp-resume-checkpoint-issue
cd dvc-exp-resume-checkpoint-issue
pip install -r requirements.txt
bash ./reproduce_issue.sh

Expected

On the second experiment, it should try to resume counting from 3 and exit immediately. Observed behaviour is that is just starts from 0 again.

Environment information

DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.10.4 on macOS-12.3.1-x86_64-i386-64bit
Supports:
        webhdfs (fsspec = 2022.2.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.2.0, boto3 = 1.20.24)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:11 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
mattlbeckcommented, Jun 8, 2022

One job per instance sounds quite sensible, but I think I would have to deploy with TPI for each experiment I want to run, is that correct? It will be a fair bit more engineering to do it this way - the nice thing about the current workflow was we could let DVC worry about scheduling jobs correctly given a fixed number of parallel jobs.

2reactions
mattlbeckcommented, Jun 7, 2022

Reading the linked conversations, it seems like I have stumbled into a bit of a checkpoint behaviour minefield! I won’t pretend to fully understand all the target use cases, but it sounds like you are trying to accomodate a lot of disparate things with this behaviour. Perhaps I can help most at this point by fully explaining our current use case.

We want to use dvc exp functionality with queue and parallel jobs to run hyperparameter tuning experiments on a remote AWS spot instance. We are looking at using CML/TPI to help us with this as well, and our aim is to set up easy deployment of long hyperparameter tuning jobs that overcome spot interruptions. The workflow we are hoping to achieve is:

  • Add functionality to training jobs to create DVC checkpoints routinely
  • make a script to --queue various experiments with different parameters, then --run-all --jobs. The script is run from a deployed spot instance
  • When the spot instance terminates, CML/TPI should be able to re-deploy a new spot instance. They appear to also sync intermediary data between instances, but I am not sure that this is useful in the case of multiple --temp experiments.
  • Instead, we need to pull the latest checkpoint for each job and resume from that experiment revision. This is possible within the same script that we are using to initiate multiple jobs by checking if there are existing experiments at the current git revision.

I think we are finding in particular that the combination of parallel exp jobs, checkpoint, and remote instance training is quite tricky!

Read more comments on GitHub >

github_iconTop Results From Across the Web

exp run | Data Version Control - DVC
Provides a way to execute and track experimentsexperiments in your projectproject without polluting it with unnecessary commits, branches, directories, etc.
Read more >
Resume Writing Guide | SEATTLE CITY LIGHT
Your resume, along with the job application, is a great way to highlight relevant skills ... experience and background will not be tolerated....
Read more >
CHAPTER 4. RESERVE FUND FOR REPLACEMENTS 4 - HUD
If the balance in the Fund should fall below the recommended minimum threshold, monthly deposits would resume at no less than the previous...
Read more >
Known Problem Report as of Dec 12 2022 4:00AM - Agilent
KPR#:22 In 68XX driver revision 5.02 - 6.11, it is not possible to Modify a Method During a Run to Extend the Run...
Read more >
PNGV Battery Test Manual Revision 3
cycling is done at elevated temperatures.) The number of test profiles executed at end of test is not necessarily equal to the cycle-life...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found