Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DVC cannot tracked files when the script is aborted...

See original GitHub issue

Bug Report

Description

DVC don’t track files when script is aborted via CTRL+C.

Motivation: Sometimes training scripts take much time and we don’t know when It is going to finish. Hence, we might need to stop script manually and still want to track files ,especially log files. When I stop the script manually, DVC don’t track them.

Reproduce

Let’s assume that the following code snippet is our training script.

import os
import time
os.makedirs("log", exist_ok=True)
f = open("log/demofile.txt","w")

for i in range(200):
    time.sleep(1)
    f.write("train step: " + str(i) + " \n")
    f.flush()

f.close()

dvc init

To add the stage configuration, I run the following code, dvc run --force -n dummy_traning -d train.py -o log python train.py

After sometime, I stop the script and check the log folder and see the following lines

train step: 0 
train step: 1 
train step: 2

However, unfortunately, DVC could not generate dvc.yaml file and don’t track the data.

To work around this, I can create dvc.yaml file manually or comment out the time.sleep(1) in the training.py script and run the dvc run --force -n dummy_traning -d train.py -o log python train.py again

Let’s assume that I commented out the line time.sleep(1) and run the dvc run --force -n dummy_traning -d train.py -o log python train.py As the script finished succesfully, It would able to generate dvc.yaml and dvc.lock files.

dvc run --force -n dummy_traning -d train.py -o log python train.py
Running stage 'dummy_traning':                                        
> python train.py
Modifying stage 'dummy_traning' in 'dvc.yaml'                                                                                                                                                                                                                                              
Updating lock file 'dvc.lock'                                                                                                                                                                                                                                                              

To track the changes with git, run:

	git add dvc.yaml dvc.lock

To enable auto staging, run:

	dvc config core.autostage true

Now from this point, I can add these files in git and run the different dvc experiment

git add . git commit -m "Add DVC Configuration"

Now, I would like to run different experiment and track the data

Let’s assume that my first experiment take around 2 seconds. In other words,I replace time.sleep(1) with time.sleep(0.01) and run the experiment

import os
import time
os.makedirs("log", exist_ok=True)
f = open("log/demofile.txt","w")

for i in range(200):
    time.sleep(0.01)
    f.write("train step: " + str(i) + " \n")
    f.flush()

f.close()

dvc exp run -n erdi_test1

Terminal output

Running stage 'dummy_traning':                                                                                                                                                                                                                                                             
> python train.py                                                                                                                                                                                                                                                                          
Updating lock file 'dvc.lock'                                                                                                                                                                                                                                                              

To track the changes with git, run:

	git add dvc.yaml dvc.lock train.py

To enable auto staging, run:

	dvc config core.autostage true
                                                                      
Ran experiment(s): erdi_test1
Experiment results have been applied to your workspace.

To promote an experiment to a Git branch run:

	dvc exp branch <exp> <branch>

We can succesfully track the data from experiment1

Let’s assume that my second experiment take around 200 seconds. In other words,I replace time.sleep(0.01) with time.sleep(1) and run the experiment. After sometime, I will stop the experiment

import os
import time
os.makedirs("log", exist_ok=True)
f = open("log/demofile.txt","w")

for i in range(200):
    time.sleep(1)
    f.write("train step: " + str(i) + " \n")
    f.flush()

f.close()

dvc exp run -n erdi_test2

Terminal output ( I stop the script manually with CTRL+C)

Running stage 'dummy_traning':                                                                                                                                                                                                                                                             
> python train.py                                                                                                                                                                                                                                                                          
^CTraceback (most recent call last):
  File "train.py", line 7, in <module>
    time.sleep(1)
KeyboardInterrupt
ERROR: failed to reproduce 'dummy_traning': failed to run: python train.py, exited with -2

Unfortunately, DVC cannot track this experiment files but I have still the following log file generated by experiment2

train step: 0 
train step: 1 
train step: 2

Expected

My expectation was I could track these files. As a user, I would like to track files even It is stopped manually. I believe It is a major bug in the DVC.

Environment information

Output of dvc doctor:

$ dvc doctor

dvc doctor
DVC version: 2.18.0 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.15.0-43-generic-x86_64-with-glibc2.29
Supports:
	gdrive (pydrive2 = 1.14.0),
	webhdfs (fsspec = 2022.7.1),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda2
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git

Additional Information (if any):

Issue Analytics

State:
Created a year ago
Comments:14 (1 by maintainers)

Top GitHub Comments

1reaction

daavoocommented, Aug 17, 2022

Can we close this one?

1reaction

pmrowlacommented, Aug 16, 2022

The missing version info message usually means some DVC data was not pushed properly.

ERROR: ‘erdi_test1’ does not appear to be an experiment commit.: Experiment derived from ‘a0a5744’, expected ‘33dd781’.

In this case, the given experiment was created based on commit a0a5744 but your current checked out HEAD is 33dd781. exp apply will only apply experiments onto the commit the experiment was originally run against.

This should work as expected:

git checkout a0a5744
dvc exp apply erdi_test1

It may be easier to visualize all of this using dvc exp show instead of dvc exp list. If you run dvc exp show -A to see all of your (locally available/pulled) experiments, it should show that your current git HEAD (33dd781) has no experiments based on that commit. But commit a0a5744 will contain the 3 erdi_test experiments.