`exp run`: `--temp` `--rev` does not properly resume from the target revision
See original GitHub issueBug Report
Description
The command appears to use whatever is checked out in the workspace to resume from, despite providing a different --rev
Reproduce
Clone the worked example repository and run the reprduction script
git clone https://github.com/mattlbeck/dvc-exp-resume-checkpoint-issue
cd dvc-exp-resume-checkpoint-issue
pip install -r requirements.txt
bash ./reproduce_issue.sh
Expected
On the second experiment, it should try to resume counting from 3 and exit immediately. Observed behaviour is that is just starts from 0 again.
Environment information
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.10.4 on macOS-12.3.1-x86_64-i386-64bit
Supports:
webhdfs (fsspec = 2022.2.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.2.0, boto3 = 1.20.24)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:11 (9 by maintainers)
Top Results From Across the Web
exp run | Data Version Control - DVC
Provides a way to execute and track experimentsexperiments in your projectproject without polluting it with unnecessary commits, branches, directories, etc.
Read more >Resume Writing Guide | SEATTLE CITY LIGHT
Your resume, along with the job application, is a great way to highlight relevant skills ... experience and background will not be tolerated....
Read more >CHAPTER 4. RESERVE FUND FOR REPLACEMENTS 4 - HUD
If the balance in the Fund should fall below the recommended minimum threshold, monthly deposits would resume at no less than the previous...
Read more >Known Problem Report as of Dec 12 2022 4:00AM - Agilent
KPR#:22 In 68XX driver revision 5.02 - 6.11, it is not possible to Modify a Method During a Run to Extend the Run...
Read more >PNGV Battery Test Manual Revision 3
cycling is done at elevated temperatures.) The number of test profiles executed at end of test is not necessarily equal to the cycle-life...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
One job per instance sounds quite sensible, but I think I would have to deploy with TPI for each experiment I want to run, is that correct? It will be a fair bit more engineering to do it this way - the nice thing about the current workflow was we could let DVC worry about scheduling jobs correctly given a fixed number of parallel jobs.
Reading the linked conversations, it seems like I have stumbled into a bit of a checkpoint behaviour minefield! I won’t pretend to fully understand all the target use cases, but it sounds like you are trying to accomodate a lot of disparate things with this behaviour. Perhaps I can help most at this point by fully explaining our current use case.
We want to use
dvc exp
functionality withqueue
and parallel jobs to run hyperparameter tuning experiments on a remote AWS spot instance. We are looking at using CML/TPI to help us with this as well, and our aim is to set up easy deployment of long hyperparameter tuning jobs that overcome spot interruptions. The workflow we are hoping to achieve is:--queue
various experiments with different parameters, then--run-all
--jobs
. The script is run from a deployed spot instance--temp
experiments.I think we are finding in particular that the combination of parallel exp jobs, checkpoint, and remote instance training is quite tricky!