question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DVC cannot tracked files when the script is aborted...

See original GitHub issue

Bug Report

Description

DVC don’t track files when script is aborted via CTRL+C.

Motivation: Sometimes training scripts take much time and we don’t know when It is going to finish. Hence, we might need to stop script manually and still want to track files ,especially log files. When I stop the script manually, DVC don’t track them.

Reproduce

Let’s assume that the following code snippet is our training script.

import os
import time
os.makedirs("log", exist_ok=True)
f = open("log/demofile.txt","w")

for i in range(200):
    time.sleep(1)
    f.write("train step: " + str(i) + " \n")
    f.flush()

f.close()

dvc init

To add the stage configuration, I run the following code, dvc run --force -n dummy_traning -d train.py -o log python train.py

After sometime, I stop the script and check the log folder and see the following lines

train step: 0 
train step: 1 
train step: 2 

However, unfortunately, DVC could not generate dvc.yaml file and don’t track the data.

To work around this, I can create dvc.yaml file manually or comment out the time.sleep(1) in the training.py script and run the dvc run --force -n dummy_traning -d train.py -o log python train.py again

Let’s assume that I commented out the line time.sleep(1) and run the dvc run --force -n dummy_traning -d train.py -o log python train.py As the script finished succesfully, It would able to generate dvc.yaml and dvc.lock files.

dvc run --force -n dummy_traning -d train.py -o log python train.py
Running stage 'dummy_traning':                                        
> python train.py
Modifying stage 'dummy_traning' in 'dvc.yaml'                                                                                                                                                                                                                                              
Updating lock file 'dvc.lock'                                                                                                                                                                                                                                                              

To track the changes with git, run:

	git add dvc.yaml dvc.lock

To enable auto staging, run:

	dvc config core.autostage true

Now from this point, I can add these files in git and run the different dvc experiment

git add . git commit -m "Add DVC Configuration"

Now, I would like to run different experiment and track the data

Let’s assume that my first experiment take around 2 seconds. In other words,I replace time.sleep(1) with time.sleep(0.01) and run the experiment

import os
import time
os.makedirs("log", exist_ok=True)
f = open("log/demofile.txt","w")

for i in range(200):
    time.sleep(0.01)
    f.write("train step: " + str(i) + " \n")
    f.flush()

f.close()

dvc exp run -n erdi_test1

Terminal output

Running stage 'dummy_traning':                                                                                                                                                                                                                                                             
> python train.py                                                                                                                                                                                                                                                                          
Updating lock file 'dvc.lock'                                                                                                                                                                                                                                                              

To track the changes with git, run:

	git add dvc.yaml dvc.lock train.py

To enable auto staging, run:

	dvc config core.autostage true
                                                                      
Ran experiment(s): erdi_test1
Experiment results have been applied to your workspace.

To promote an experiment to a Git branch run:

	dvc exp branch <exp> <branch>

We can succesfully track the data from experiment1

Let’s assume that my second experiment take around 200 seconds. In other words,I replace time.sleep(0.01) with time.sleep(1) and run the experiment. After sometime, I will stop the experiment

import os
import time
os.makedirs("log", exist_ok=True)
f = open("log/demofile.txt","w")

for i in range(200):
    time.sleep(1)
    f.write("train step: " + str(i) + " \n")
    f.flush()

f.close()

dvc exp run -n erdi_test2

Terminal output ( I stop the script manually with CTRL+C)

Running stage 'dummy_traning':                                                                                                                                                                                                                                                             
> python train.py                                                                                                                                                                                                                                                                          
^CTraceback (most recent call last):
  File "train.py", line 7, in <module>
    time.sleep(1)
KeyboardInterrupt
ERROR: failed to reproduce 'dummy_traning': failed to run: python train.py, exited with -2

Unfortunately, DVC cannot track this experiment files but I have still the following log file generated by experiment2

train step: 0 
train step: 1 
train step: 2 

Expected

My expectation was I could track these files. As a user, I would like to track files even It is stopped manually. I believe It is a major bug in the DVC.

Environment information

Output of dvc doctor:

$ dvc doctor
dvc doctor
DVC version: 2.18.0 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.15.0-43-generic-x86_64-with-glibc2.29
Supports:
	gdrive (pydrive2 = 1.14.0),
	webhdfs (fsspec = 2022.7.1),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda2
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git

Additional Information (if any):

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:14 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
daavoocommented, Aug 17, 2022

Can we close this one?

1reaction
pmrowlacommented, Aug 16, 2022

The missing version info message usually means some DVC data was not pushed properly.

ERROR: ‘erdi_test1’ does not appear to be an experiment commit.: Experiment derived from ‘a0a5744’, expected ‘33dd781’.

In this case, the given experiment was created based on commit a0a5744 but your current checked out HEAD is 33dd781. exp apply will only apply experiments onto the commit the experiment was originally run against.

This should work as expected:

git checkout a0a5744
dvc exp apply erdi_test1

It may be easier to visualize all of this using dvc exp show instead of dvc exp list. If you run dvc exp show -A to see all of your (locally available/pulled) experiments, it should show that your current git HEAD (33dd781) has no experiments based on that commit. But commit a0a5744 will contain the 3 erdi_test experiments.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Output '' is already tracked by SCM (e.g. Git) - DVC
This error means that <data file> is being tracked by Git. So it can't also be tracked by DVC, unfortunately. You can use...
Read more >
git - .gitignore and "The following untracked working tree files ...
WARNING: it will delete untracked files, so it's not a great answer to the question being posed. I hit this message as well....
Read more >
The DataLad Handbook
ANNEX, DataLad can easily track files that are many TB or PB in size (something that Git could not do, and allows you...
Read more >
xfreerdp command man page - freerdp - ManKier
/action-script:file-name. Action script (default:~/.config/freerdp/action.sh) ... abort connection for any certificate that can not be validated.
Read more >
How To Use Git Hooks To Automate Development and ...
Because of this, it cannot abort the process, and is mainly used for creating ... Add the new file to tell git to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found