dvc: performance optimization for directories
See original GitHub issueContext is here:
Our data set is OCR data set with more than 100000 small images, total size is about 200 MB. Using dvc to track this data set we encountered with next problems:
It took a lot of time to add data set for tracking.
Very slow upload.
Very slow download.
Update/delete/add just one image in data set cause dvc to recompute a lot of things : hashes etc....
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:18 (13 by maintainers)
Top Results From Across the Web
Large Dataset Optimization | Data Version Control - DVC
Open-source version control system for Data Science and Machine Learning projects. Git-like experience to organize your data, models, and experiments.
Read more >16 Essential DVC Commands for Data Science - KDnuggets
The init command has created a .dvc directory. ... machine learning pipeline using `dvc repro`, the model performance metrics are generated.
Read more >Data & Model Management with DVC | Analytics Vidhya
These include the configuration files, local cache, plot templates & temporary files (more information). It is similar to the .git/ folder ...
Read more >5.1. Reproducible machine learning analyses: DataLad as DVC ...
The data directories in data/raw are then version controlled with the dvc add ... DVC can then read from these files to display...
Read more >How to add a file to a dvc-tracked folder without pulling the ...
Adding the data folder again: mkdir data mv path/to/newfile.txt data/newfile. · Adding the file as a single element in data folder: dvc add...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Sample script that seems to reproduce users problem:
After adding update, md5 computation for large directory is retriggered.
Some takeouts:
I saved all the output with timestamps, so that could be analyzed where we have sleeps, slow ins and outs.
Another things is that this is tested with cache type cope only.