question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dependend directory not in dvc

See original GitHub issue

dvc-version: 0.21.2 linux/ubuntu python: 3.6.5

I have the following folder-structure (for image-classification):

data
|___ raw
|      |___ apple
|      |___ orange
|___ processed
|      |___ train
|      |     |___ apple
|      |     |___ orange
|      |___ test
|      |     |___ apple
|      |     |___ orange

Inside directories raw/apple and orange are image-files. With split_dataset.py these files are copied to the processed-directory and split into training- and test-images (randomly) So my datapreparation-step is:

dvc run -d data/raw -d split_dataset.py -o data/processed -f data.dvc python split_dataset.py The output is:

Adding 'data/processed' to 'data/.gitignore'.
Saving 'data/processed' to cache '.dvc/cache'.
Linking directory 'data/processed'.
Saving information to 'data.dvc'.

The content of data.dvc is

cmd: python3 split_dataset.py
deps:
- md5: 749f2c46188b40a30ac18106e466e543.dir
  path: data/raw
- md5: c46da892a15ff5c9425bd3fddfee1a14
  path: split_dataset.py
md5: f5144729533bc4ec9dd36ae3b5218fbd
outs:
- cache: true
  md5: a8c5831d07f8c882460d9567dfcf582b.dir
  path: data/processed

After doing dvc remote and dvc push … output:

Preparing to push data to s3://...
[##############################] 100% Collecting information
(1/28): [##############################] 100% data/processed
(2/28): [##############################] 100% data/processed/train/apple/d0.jpg
(3/28): [##############################] 100% data/processed/train/apple/d9.jpg
(4/28): [##############################] 100% data/processed/train/orange/i05.jpg
(5/28): [##############################] 100% data/processed/test/apple/d4.jpg
(6/28): [##############################] 100% data/processed/train/apple/d1.jpg
(7/28): [##############################] 100% data/processed/train/apple/db.jpg
(8/28): [##############################] 100% data/processed/test/apple/d5.jpg
(9/28): [##############################] 100% data/processed/train/apple/d2.jpg
(10/28): [##############################] 100% data/processed/train/orange/i07.jpg
(11/28): [##############################] 100% data/processed/train/apple/dc.jpg
(12/28): [##############################] 100% data/processed/test/apple/da.jpg
(13/28): [##############################] 100% data/processed/train/apple/d3.jpg
(14/28): [##############################] 100% data/processed/train/orange/i08.jpg
(15/28): [##############################] 100% data/processed/train/orange/i00.jpg
(16/28): [##############################] 100% data/processed/train/orange/i09.jpg
(17/28): [##############################] 100% data/processed/train/apple/d6.jpg
(18/28): [##############################] 100% data/processed/train/orange/i01.jpg
(19/28): [##############################] 100% data/processed/test/orange/i03.jpg
(20/28): [##############################] 100% data/processed/train/orange/i10.jpg
(21/28): [##############################] 100% data/processed/train/apple/d7.jpg
(22/28): [##############################] 100% data/processed/train/orange/i02.jpg
(23/28): [##############################] 100% data/processed/test/orange/i06.jpg
(24/28): [##############################] 100% data/processed/train/orange/i04.jpg
(25/28): [##############################] 100% data/processed/train/apple/d8.jpg
(26/28): [##############################] 100% data/processed/train/orange/s2.jpg
(27/28): [##############################] 100% data/processed/train/orange/s3.jpg
(28/28): [##############################] 100% data/processed/test/orange/s1.jpg

I checkout the project on a different machine with: git clone and dvc pull The directory data/raw is not created.

The directory is given in the dependencies (-d) but is somehow ignored. But it’s listed in the .dvc file.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
stvogelcommented, Dec 1, 2018

Hi Ruslan,

my focus is on deep learning and image recognition. Along with that comes a large amount of training images and models that are really large (80MB up to 500MB). My large models and all the image-files doesn’t fit into the corporate git and so I’m now using DVC to version my model and the corresponding image-files and store the large data in S3 and the code in GIT. The combination of code/data resp. GIT and DVC was something I was searching for a long time. And DVC fits this in such an easy and flexible way, that I couldn’t imagine the hussle I had before.

0reactions
efiopcommented, Dec 1, 2018

@stvogel Glad to hear that dvc is so useful in your scenario 🙂 Thanks for all the feedback!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Directory dependency wrongly reported as changed #2144
a Docker container running in a CI service), DVC wrongly reports that the dependencies have changed, even if the cache directory ( .dvc/cache...
Read more >
How to Add Dependencies or Outputs to a Stage - DVC
To add dependencies or outputs to a stage, edit the dvc. yaml file (by hand or using dvc stage add with the -f...
Read more >
DVC dependencies for derived data without imports
Currently my Idea is to dvc get the relevant data, put it in a untracked temporary folder and then again manage the derived...
Read more >
Dependency Override - Diablo Valley College
Circumstances that do not qualify for a dependency override appeal: Parents refuse to contribute to the student's education; Parents are ...
Read more >
Using Guild AI with DvC - General
DvC configuration is copied to the run directory and used to initialize a new, run-specific DvC repository; Any parameter files are written to ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found