question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallelize computing md5 for large files

See original GitHub issue

After executing dvc run, I have been staring at the Computing md5 for a large file message for the past five hours. However, only a single CPU and about 20% of the SSD disk’s maximum read speed are utilized. Therefore, a 5× speed-up can be achieved just by computing MD5 in parallel.

We may have to get rid of the per-file progress bars, but I think that a single overall progress bar would be more informative anyways, especially with hundreds of files being added to the cache. What do you think?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:6
  • Comments:23 (23 by maintainers)

github_iconTop GitHub Comments

2reactions
Witikocommented, May 31, 2019

We have been thinking about splitting files on a smaller scale as per #829 too. One of the problems with that is that you would be doubling your storage, since you’ll need to reconstruct the files back in the cache and then link them to the workspace.

This is only true for hardlinks and symlinks. You can reflink just a chunk of a file (see ioctl_ficlonerange), which allows you to move chunks of files between the cache and the workspace without copying. However, I am not sure that Darwin supports this (likely not).

2reactions
Witikocommented, May 31, 2019

btw as @shcheklein rightfully noted, we could possibly use that s3 approach too to be able to compute md5 of a large file in parallel by splitting it into chunks.

That is an interesting suggestion. Where does @shcheklein discuss this?

btw @Witiko , which problem is more relevant for you? Small number of large files or big number of small files?

The most relevant problem for me is computing hashes for a large number of large files, because that’s what takes the longest. Parallelizing over files speeds up the hash computation for a large number of large files, but it offers only a small speed-up for a large number of small files and a small number of large files.

This actually points to a bigger problem with the cache, local remotes, and HTTP remotes: you may not be able to pull a local or HTTP remote if it contains a file that is too large for your filesystem (such as the 4G limit of FAT32). We could solve that by chunking the files that we store in the cache, but that also means that we could no longer hardlink and symlink files to the cache. Moreover, it would also make the format of the cache, local remotes, and HTTP remotes backwards-incompatible, so this would be quite a significant change to the fundamental design of dvc!

If we started chunking the files that we store in the cache, and we also started packing small files together, then parallelizing over files would speed up the hash computation for any number of files (both large and small). However, as I said above, this would be a major change.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Bash: parallelize md5sum checksum on many files
I am looking for a way to compute 64 md5sums of 64 different files in parallel, as long as there are any files...
Read more >
Parallelizing MD5 Checksum Computation to Speed Up S3 ...
Learn how to optimize performance by parallelizing MD5 checksum computation with the Go assembler to avoid any data slowdown from Amazon S3 ...
Read more >
How does one check huge files identity if hashing is CPU ...
What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's...
Read more >
High Performance Multi-Node File Copies and Checksums for ...
HP-UX MD5 Secure Checksum [13] is an md5sum util- ity that uses multi-threading to compute the checksums of multiple files at once. Unlike...
Read more >
linux - How can I verify that a 1TB file transferred correctly?
long msg 4096 B 64 B MD5 5.0 5.2 13.1 SHA1 4.7 4.8 13.7 SHA256 12.8 13.0 30.0 ... Another approach is to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found