Dependend directory not in dvc
See original GitHub issuedvc-version: 0.21.2 linux/ubuntu python: 3.6.5
I have the following folder-structure (for image-classification):
data
|___ raw
| |___ apple
| |___ orange
|___ processed
| |___ train
| | |___ apple
| | |___ orange
| |___ test
| | |___ apple
| | |___ orange
Inside directories raw/apple and orange are image-files. With split_dataset.py these files are copied to the processed-directory and split into training- and test-images (randomly) So my datapreparation-step is:
dvc run -d data/raw -d split_dataset.py -o data/processed -f data.dvc python split_dataset.py
The output is:
Adding 'data/processed' to 'data/.gitignore'.
Saving 'data/processed' to cache '.dvc/cache'.
Linking directory 'data/processed'.
Saving information to 'data.dvc'.
The content of data.dvc is
cmd: python3 split_dataset.py
deps:
- md5: 749f2c46188b40a30ac18106e466e543.dir
path: data/raw
- md5: c46da892a15ff5c9425bd3fddfee1a14
path: split_dataset.py
md5: f5144729533bc4ec9dd36ae3b5218fbd
outs:
- cache: true
md5: a8c5831d07f8c882460d9567dfcf582b.dir
path: data/processed
After doing dvc remote and dvc push … output:
Preparing to push data to s3://...
[##############################] 100% Collecting information
(1/28): [##############################] 100% data/processed
(2/28): [##############################] 100% data/processed/train/apple/d0.jpg
(3/28): [##############################] 100% data/processed/train/apple/d9.jpg
(4/28): [##############################] 100% data/processed/train/orange/i05.jpg
(5/28): [##############################] 100% data/processed/test/apple/d4.jpg
(6/28): [##############################] 100% data/processed/train/apple/d1.jpg
(7/28): [##############################] 100% data/processed/train/apple/db.jpg
(8/28): [##############################] 100% data/processed/test/apple/d5.jpg
(9/28): [##############################] 100% data/processed/train/apple/d2.jpg
(10/28): [##############################] 100% data/processed/train/orange/i07.jpg
(11/28): [##############################] 100% data/processed/train/apple/dc.jpg
(12/28): [##############################] 100% data/processed/test/apple/da.jpg
(13/28): [##############################] 100% data/processed/train/apple/d3.jpg
(14/28): [##############################] 100% data/processed/train/orange/i08.jpg
(15/28): [##############################] 100% data/processed/train/orange/i00.jpg
(16/28): [##############################] 100% data/processed/train/orange/i09.jpg
(17/28): [##############################] 100% data/processed/train/apple/d6.jpg
(18/28): [##############################] 100% data/processed/train/orange/i01.jpg
(19/28): [##############################] 100% data/processed/test/orange/i03.jpg
(20/28): [##############################] 100% data/processed/train/orange/i10.jpg
(21/28): [##############################] 100% data/processed/train/apple/d7.jpg
(22/28): [##############################] 100% data/processed/train/orange/i02.jpg
(23/28): [##############################] 100% data/processed/test/orange/i06.jpg
(24/28): [##############################] 100% data/processed/train/orange/i04.jpg
(25/28): [##############################] 100% data/processed/train/apple/d8.jpg
(26/28): [##############################] 100% data/processed/train/orange/s2.jpg
(27/28): [##############################] 100% data/processed/train/orange/s3.jpg
(28/28): [##############################] 100% data/processed/test/orange/s1.jpg
I checkout the project on a different machine with: git clone and dvc pull The directory data/raw is not created.
The directory is given in the dependencies (-d) but is somehow ignored. But it’s listed in the .dvc file.
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Directory dependency wrongly reported as changed #2144
a Docker container running in a CI service), DVC wrongly reports that the dependencies have changed, even if the cache directory ( .dvc/cache...
Read more >How to Add Dependencies or Outputs to a Stage - DVC
To add dependencies or outputs to a stage, edit the dvc. yaml file (by hand or using dvc stage add with the -f...
Read more >DVC dependencies for derived data without imports
Currently my Idea is to dvc get the relevant data, put it in a untracked temporary folder and then again manage the derived...
Read more >Dependency Override - Diablo Valley College
Circumstances that do not qualify for a dependency override appeal: Parents refuse to contribute to the student's education; Parents are ...
Read more >Using Guild AI with DvC - General
DvC configuration is copied to the run directory and used to initialize a new, run-specific DvC repository; Any parameter files are written to ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi Ruslan,
my focus is on deep learning and image recognition. Along with that comes a large amount of training images and models that are really large (80MB up to 500MB). My large models and all the image-files doesn’t fit into the corporate git and so I’m now using DVC to version my model and the corresponding image-files and store the large data in S3 and the code in GIT. The combination of code/data resp. GIT and DVC was something I was searching for a long time. And DVC fits this in such an easy and flexible way, that I couldn’t imagine the hussle I had before.
@stvogel Glad to hear that dvc is so useful in your scenario 🙂 Thanks for all the feedback!