Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`exp run`: `--temp` `--rev` does not properly resume from the target revision

See original GitHub issue

Bug Report

Description

The command appears to use whatever is checked out in the workspace to resume from, despite providing a different --rev

Reproduce

Clone the worked example repository and run the reprduction script

git clone https://github.com/mattlbeck/dvc-exp-resume-checkpoint-issue
cd dvc-exp-resume-checkpoint-issue
pip install -r requirements.txt
bash ./reproduce_issue.sh

Expected

On the second experiment, it should try to resume counting from 3 and exit immediately. Observed behaviour is that is just starts from 0 again.

Environment information

DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.10.4 on macOS-12.3.1-x86_64-i386-64bit
Supports:
        webhdfs (fsspec = 2022.2.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.2.0, boto3 = 1.20.24)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:11 (9 by maintainers)

Top GitHub Comments

2reactions

mattlbeckcommented, Jun 8, 2022

One job per instance sounds quite sensible, but I think I would have to deploy with TPI for each experiment I want to run, is that correct? It will be a fair bit more engineering to do it this way - the nice thing about the current workflow was we could let DVC worry about scheduling jobs correctly given a fixed number of parallel jobs.

2reactions

mattlbeckcommented, Jun 7, 2022

Reading the linked conversations, it seems like I have stumbled into a bit of a checkpoint behaviour minefield! I won’t pretend to fully understand all the target use cases, but it sounds like you are trying to accomodate a lot of disparate things with this behaviour. Perhaps I can help most at this point by fully explaining our current use case.

We want to use dvc exp functionality with queue and parallel jobs to run hyperparameter tuning experiments on a remote AWS spot instance. We are looking at using CML/TPI to help us with this as well, and our aim is to set up easy deployment of long hyperparameter tuning jobs that overcome spot interruptions. The workflow we are hoping to achieve is:

Add functionality to training jobs to create DVC checkpoints routinely
make a script to --queue various experiments with different parameters, then --run-all --jobs. The script is run from a deployed spot instance
When the spot instance terminates, CML/TPI should be able to re-deploy a new spot instance. They appear to also sync intermediary data between instances, but I am not sure that this is useful in the case of multiple --temp experiments.
Instead, we need to pull the latest checkpoint for each job and resume from that experiment revision. This is possible within the same script that we are using to initiate multiple jobs by checking if there are existing experiments at the current git revision.

I think we are finding in particular that the combination of parallel exp jobs, checkpoint, and remote instance training is quite tricky!

Top Results From Across the Web

exp run | Data Version Control - DVC

Provides a way to execute and track experimentsexperiments in your projectproject without polluting it with unnecessary commits, branches, directories, etc.

Resume Writing Guide | SEATTLE CITY LIGHT

Your resume, along with the job application, is a great way to highlight relevant skills ... experience and background will not be tolerated....

CHAPTER 4. RESERVE FUND FOR REPLACEMENTS 4 - HUD

If the balance in the Fund should fall below the recommended minimum threshold, monthly deposits would resume at no less than the previous...

Known Problem Report as of Dec 12 2022 4:00AM - Agilent

KPR#:22 In 68XX driver revision 5.02 - 6.11, it is not possible to Modify a Method During a Run to Extend the Run...

PNGV Battery Test Manual Revision 3

cycling is done at elevated temperatures.) The number of test profiles executed at end of test is not necessarily equal to the cycle-life...