Parallelize computing md5 for large files
See original GitHub issueAfter executing dvc run
, I have been staring at the Computing md5 for a large file
message for the past five hours. However, only a single CPU and about 20% of the SSD disk’s maximum read speed are utilized. Therefore, a 5× speed-up can be achieved just by computing MD5 in parallel.
We may have to get rid of the per-file progress bars, but I think that a single overall progress bar would be more informative anyways, especially with hundreds of files being added to the cache. What do you think?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:6
- Comments:23 (23 by maintainers)
Top Results From Across the Web
Bash: parallelize md5sum checksum on many files
I am looking for a way to compute 64 md5sums of 64 different files in parallel, as long as there are any files...
Read more >Parallelizing MD5 Checksum Computation to Speed Up S3 ...
Learn how to optimize performance by parallelizing MD5 checksum computation with the Go assembler to avoid any data slowdown from Amazon S3 ...
Read more >How does one check huge files identity if hashing is CPU ...
What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's...
Read more >High Performance Multi-Node File Copies and Checksums for ...
HP-UX MD5 Secure Checksum [13] is an md5sum util- ity that uses multi-threading to compute the checksums of multiple files at once. Unlike...
Read more >linux - How can I verify that a 1TB file transferred correctly?
long msg 4096 B 64 B MD5 5.0 5.2 13.1 SHA1 4.7 4.8 13.7 SHA256 12.8 13.0 30.0 ... Another approach is to...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
This is only true for hardlinks and symlinks. You can reflink just a chunk of a file (see
ioctl_ficlonerange
), which allows you to move chunks of files between the cache and the workspace without copying. However, I am not sure that Darwin supports this (likely not).That is an interesting suggestion. Where does @shcheklein discuss this?
The most relevant problem for me is computing hashes for a large number of large files, because that’s what takes the longest. Parallelizing over files speeds up the hash computation for a large number of large files, but it offers only a small speed-up for a large number of small files and a small number of large files.
This actually points to a bigger problem with the cache, local remotes, and HTTP remotes: you may not be able to pull a local or HTTP remote if it contains a file that is too large for your filesystem (such as the 4G limit of FAT32). We could solve that by chunking the files that we store in the cache, but that also means that we could no longer hardlink and symlink files to the cache. Moreover, it would also make the format of the cache, local remotes, and HTTP remotes backwards-incompatible, so this would be quite a significant change to the fundamental design of dvc!
If we started chunking the files that we store in the cache, and we also started packing small files together, then parallelizing over files would speed up the hash computation for any number of files (both large and small). However, as I said above, this would be a major change.