DVC cannot tracked files when the script is aborted...
See original GitHub issueBug Report
Description
DVC don’t track files when script is aborted via CTRL+C.
Motivation: Sometimes training scripts take much time and we don’t know when It is going to finish. Hence, we might need to stop script manually and still want to track files ,especially log files. When I stop the script manually, DVC don’t track them.
Reproduce
Let’s assume that the following code snippet is our training script.
import os
import time
os.makedirs("log", exist_ok=True)
f = open("log/demofile.txt","w")
for i in range(200):
time.sleep(1)
f.write("train step: " + str(i) + " \n")
f.flush()
f.close()
dvc init
To add the stage configuration, I run the following code,
dvc run --force -n dummy_traning -d train.py -o log python train.py
After sometime, I stop the script and check the log folder and see the following lines
train step: 0
train step: 1
train step: 2
However, unfortunately, DVC could not generate dvc.yaml
file and don’t track the data.
To work around this, I can create dvc.yaml file manually or comment out the time.sleep(1)
in the training.py script and run the
dvc run --force -n dummy_traning -d train.py -o log python train.py
again
Let’s assume that I commented out the line time.sleep(1)
and run the dvc run --force -n dummy_traning -d train.py -o log python train.py
As the script finished succesfully, It would able to generate dvc.yaml and dvc.lock files.
dvc run --force -n dummy_traning -d train.py -o log python train.py
Running stage 'dummy_traning':
> python train.py
Modifying stage 'dummy_traning' in 'dvc.yaml'
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.yaml dvc.lock
To enable auto staging, run:
dvc config core.autostage true
Now from this point, I can add these files in git and run the different dvc experiment
git add .
git commit -m "Add DVC Configuration"
Now, I would like to run different experiment and track the data
Let’s assume that my first experiment take around 2 seconds. In other words,I replace time.sleep(1)
with time.sleep(0.01)
and run the experiment
import os
import time
os.makedirs("log", exist_ok=True)
f = open("log/demofile.txt","w")
for i in range(200):
time.sleep(0.01)
f.write("train step: " + str(i) + " \n")
f.flush()
f.close()
dvc exp run -n erdi_test1
Terminal output
Running stage 'dummy_traning':
> python train.py
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.yaml dvc.lock train.py
To enable auto staging, run:
dvc config core.autostage true
Ran experiment(s): erdi_test1
Experiment results have been applied to your workspace.
To promote an experiment to a Git branch run:
dvc exp branch <exp> <branch>
We can succesfully track the data from experiment1
Let’s assume that my second experiment take around 200 seconds. In other words,I replace time.sleep(0.01)
with time.sleep(1)
and run the experiment. After sometime, I will stop the experiment
import os
import time
os.makedirs("log", exist_ok=True)
f = open("log/demofile.txt","w")
for i in range(200):
time.sleep(1)
f.write("train step: " + str(i) + " \n")
f.flush()
f.close()
dvc exp run -n erdi_test2
Terminal output ( I stop the script manually with CTRL+C)
Running stage 'dummy_traning':
> python train.py
^CTraceback (most recent call last):
File "train.py", line 7, in <module>
time.sleep(1)
KeyboardInterrupt
ERROR: failed to reproduce 'dummy_traning': failed to run: python train.py, exited with -2
Unfortunately, DVC cannot track this experiment files but I have still the following log file generated by experiment2
train step: 0
train step: 1
train step: 2
Expected
My expectation was I could track these files. As a user, I would like to track files even It is stopped manually. I believe It is a major bug in the DVC.
Environment information
Output of dvc doctor
:
$ dvc doctor
dvc doctor
DVC version: 2.18.0 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.15.0-43-generic-x86_64-with-glibc2.29
Supports:
gdrive (pydrive2 = 1.14.0),
webhdfs (fsspec = 2022.7.1),
http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda2
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git
Additional Information (if any):
Issue Analytics
- State:
- Created a year ago
- Comments:14 (1 by maintainers)
Top GitHub Comments
Can we close this one?
The missing version info message usually means some DVC data was not pushed properly.
In this case, the given experiment was created based on commit
a0a5744
but your current checked out HEAD is33dd781
.exp apply
will only apply experiments onto the commit the experiment was originally run against.This should work as expected:
It may be easier to visualize all of this using
dvc exp show
instead ofdvc exp list
. If you rundvc exp show -A
to see all of your (locally available/pulled) experiments, it should show that your current git HEAD (33dd781
) has no experiments based on that commit. But commita0a5744
will contain the 3erdi_test
experiments.