question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dvc: performance optimization for directories

See original GitHub issue

Context is here:

https://stackoverflow.com/questions/56035696/version-control-for-machine-learning-data-set-with-large-amount-of-images

Our data set is OCR data set with more than 100000 small images, total size is about 200 MB. Using dvc to track this data set we encountered with next problems:

It took a lot of time to add data set for tracking.
Very slow upload.
Very slow download.
Update/delete/add just one image in data set cause dvc to recompute a lot of things : hashes etc....

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:18 (13 by maintainers)

github_iconTop GitHub Comments

5reactions
paredcommented, May 8, 2019

Sample script that seems to reproduce users problem:

#! /bin/bash

rm -rf storage repo
mkdir storage repo
mkdir repo/data

for i in {1..100000}
do
  echo ${i} >> repo/data/${i}
done 

cd repo

git init 
dvc init

dvc remote add -d storage ../storage

dvc add data
dvc commit data.dvc
git add .gitignore data.dvc

git commit -am "init"
dvc push

dvc unprotect data
echo update  >> data/update
dvc add data

After adding update, md5 computation for large directory is retriggered.

3reactions
Suorcommented, Sep 27, 2019

Some takeouts:

  • checkout change slows things down significantly
  • pull/push degraded over time significantly (probably with switching from listings to batch exists, this is local remote, so take it with a grain of salt though)
  • multithreaded md5s help not as much as one might expect

I saved all the output with timestamps, so that could be analyzed where we have sleeps, slow ins and outs.

Another things is that this is tested with cache type cope only.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Large Dataset Optimization | Data Version Control - DVC
Open-source version control system for Data Science and Machine Learning projects. Git-like experience to organize your data, models, and experiments.
Read more >
16 Essential DVC Commands for Data Science - KDnuggets
The init command has created a .dvc directory. ... machine learning pipeline using `dvc repro`, the model performance metrics are generated.
Read more >
Data & Model Management with DVC | Analytics Vidhya
These include the configuration files, local cache, plot templates & temporary files (more information). It is similar to the .git/ folder ...
Read more >
5.1. Reproducible machine learning analyses: DataLad as DVC ...
The data directories in data/raw are then version controlled with the dvc add ... DVC can then read from these files to display...
Read more >
How to add a file to a dvc-tracked folder without pulling the ...
Adding the data folder again: mkdir data mv path/to/newfile.txt data/newfile. · Adding the file as a single element in data folder: dvc add...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found