question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

checkout: use more than one core to check checksum when checking into another branch

See original GitHub issue

I’m working with a few million images that sum up to ~11GB. When I run a dvc checkout, dvc checks that files haven’t changed, and if they did, it computes the new md5. This, however, could take more than 15 minutes when we are working with lots of images.

htop shows that during this process only one core is being used. Perhaps we could speed up the process using multiprocessing and distributing the check of such files across all available cores in the machine.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
efiopcommented, Nov 15, 2018

Hi @IamGianluca !

Great idea! We will need to modify collect_dir_cache method in https://github.com/iterative/dvc/blob/master/dvc/remote/local.py#L187 for that. Thanks to @mroutis , we also have a selective checkout optimization coming in 0.20.4, which should help with that, since it won’t remove unchanged files inside directories tracked by dvc. Also, I am suspecting that you are running into state db size limit(10 million), which also makes things slower. We’ve increased it to 100M in 0.20.4, which is coming today. I will notify you when it is ready, so you could give it a try to see if it makes a difference for you. We will look into parallelization soon as well.

Thank you for the feedback!

0reactions
efiopcommented, Jul 23, 2019

Currently checksums for directories are computed in multiple threads, utilizing multiple cores. Closing. Please feel free to reopen if the issue still persists.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How do git's built-in large file handling features deal with ...
I was under the assumption that it always ran a checksum on all files in the working directory on every invocation of git...
Read more >
Branches in a Nutshell - Git SCM
Switching Branches. To switch to an existing branch, you run the git checkout command. Let's switch to the new testing branch:.
Read more >
Build from source fails with checksum mismatch #3360 - GitHub
Following the directions for Build from source I'm seeing the following: $ git clone "https://github.com/caddyserver/caddy.git" Cloning into ...
Read more >
Git - Revision Control Perfected | Linux Journal
This may seem odd at first, but the reason it's called "checkout" is that you are "checking out" the head of that branch...
Read more >
5. Branch Parameters - Oracle Help Center
Specify the branch code maintained in 'Branch Core Parameter Maintenance' ... Your bank may have a branch or multiple branches for different countries....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found