Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dvc: performance optimization for directories

See original GitHub issue

Context is here:

https://stackoverflow.com/questions/56035696/version-control-for-machine-learning-data-set-with-large-amount-of-images

Our data set is OCR data set with more than 100000 small images, total size is about 200 MB. Using dvc to track this data set we encountered with next problems:

It took a lot of time to add data set for tracking.
Very slow upload.
Very slow download.
Update/delete/add just one image in data set cause dvc to recompute a lot of things : hashes etc....

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:18 (13 by maintainers)

Top GitHub Comments

5reactions

paredcommented, May 8, 2019

Sample script that seems to reproduce users problem:

#! /bin/bash

rm -rf storage repo
mkdir storage repo
mkdir repo/data

for i in {1..100000}
do
  echo ${i} >> repo/data/${i}
done 

cd repo

git init 
dvc init

dvc remote add -d storage ../storage

dvc add data
dvc commit data.dvc
git add .gitignore data.dvc

git commit -am "init"
dvc push

dvc unprotect data
echo update  >> data/update
dvc add data

After adding update, md5 computation for large directory is retriggered.

3reactions

Suorcommented, Sep 27, 2019

Some takeouts:

checkout change slows things down significantly
pull/push degraded over time significantly (probably with switching from listings to batch exists, this is local remote, so take it with a grain of salt though)
multithreaded md5s help not as much as one might expect

I saved all the output with timestamps, so that could be analyzed where we have sleeps, slow ins and outs.

Another things is that this is tested with cache type cope only.

Top Results From Across the Web

Large Dataset Optimization | Data Version Control - DVC

Open-source version control system for Data Science and Machine Learning projects. Git-like experience to organize your data, models, and experiments.

16 Essential DVC Commands for Data Science - KDnuggets

The init command has created a .dvc directory. ... machine learning pipeline using `dvc repro`, the model performance metrics are generated.

Data & Model Management with DVC | Analytics Vidhya

These include the configuration files, local cache, plot templates & temporary files (more information). It is similar to the .git/ folder ...

5.1. Reproducible machine learning analyses: DataLad as DVC ...

The data directories in data/raw are then version controlled with the dvc add ... DVC can then read from these files to display...

How to add a file to a dvc-tracked folder without pulling the ...

Adding the data folder again: mkdir data mv path/to/newfile.txt data/newfile. · Adding the file as a single element in data folder: dvc add...