checkout: use more than one core to check checksum when checking into another branch
See original GitHub issueI’m working with a few million images that sum up to ~11GB. When I run a dvc checkout
, dvc
checks that files haven’t changed, and if they did, it computes the new md5. This, however, could take more than 15 minutes when we are working with lots of images.
htop
shows that during this process only one core is being used. Perhaps we could speed up the process using multiprocessing and distributing the check of such files across all available cores in the machine.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:8 (6 by maintainers)
Top Results From Across the Web
How do git's built-in large file handling features deal with ...
I was under the assumption that it always ran a checksum on all files in the working directory on every invocation of git...
Read more >Branches in a Nutshell - Git SCM
Switching Branches. To switch to an existing branch, you run the git checkout command. Let's switch to the new testing branch:.
Read more >Build from source fails with checksum mismatch #3360 - GitHub
Following the directions for Build from source I'm seeing the following: $ git clone "https://github.com/caddyserver/caddy.git" Cloning into ...
Read more >Git - Revision Control Perfected | Linux Journal
This may seem odd at first, but the reason it's called "checkout" is that you are "checking out" the head of that branch...
Read more >5. Branch Parameters - Oracle Help Center
Specify the branch code maintained in 'Branch Core Parameter Maintenance' ... Your bank may have a branch or multiple branches for different countries....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @IamGianluca !
Great idea! We will need to modify collect_dir_cache method in https://github.com/iterative/dvc/blob/master/dvc/remote/local.py#L187 for that. Thanks to @mroutis , we also have a selective checkout optimization coming in 0.20.4, which should help with that, since it won’t remove unchanged files inside directories tracked by dvc. Also, I am suspecting that you are running into state db size limit(10 million), which also makes things slower. We’ve increased it to 100M in 0.20.4, which is coming today. I will notify you when it is ready, so you could give it a try to see if it makes a difference for you. We will look into parallelization soon as well.
Thank you for the feedback!
Currently checksums for directories are computed in multiple threads, utilizing multiple cores. Closing. Please feel free to reopen if the issue still persists.