Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallelize computing md5 for large files

See original GitHub issue

After executing dvc run, I have been staring at the Computing md5 for a large file message for the past five hours. However, only a single CPU and about 20% of the SSD disk’s maximum read speed are utilized. Therefore, a 5× speed-up can be achieved just by computing MD5 in parallel.

We may have to get rid of the per-file progress bars, but I think that a single overall progress bar would be more informative anyways, especially with hundreds of files being added to the cache. What do you think?

Issue Analytics

State:
Created 4 years ago
Reactions:6
Comments:23 (23 by maintainers)

Top GitHub Comments

2reactions

Witikocommented, May 31, 2019

We have been thinking about splitting files on a smaller scale as per #829 too. One of the problems with that is that you would be doubling your storage, since you’ll need to reconstruct the files back in the cache and then link them to the workspace.

This is only true for hardlinks and symlinks. You can reflink just a chunk of a file (see ioctl_ficlonerange), which allows you to move chunks of files between the cache and the workspace without copying. However, I am not sure that Darwin supports this (likely not).

2reactions

Witikocommented, May 31, 2019

btw as @shcheklein rightfully noted, we could possibly use that s3 approach too to be able to compute md5 of a large file in parallel by splitting it into chunks.

That is an interesting suggestion. Where does @shcheklein discuss this?

btw @Witiko , which problem is more relevant for you? Small number of large files or big number of small files?

The most relevant problem for me is computing hashes for a large number of large files, because that’s what takes the longest. Parallelizing over files speeds up the hash computation for a large number of large files, but it offers only a small speed-up for a large number of small files and a small number of large files.

This actually points to a bigger problem with the cache, local remotes, and HTTP remotes: you may not be able to pull a local or HTTP remote if it contains a file that is too large for your filesystem (such as the 4G limit of FAT32). We could solve that by chunking the files that we store in the cache, but that also means that we could no longer hardlink and symlink files to the cache. Moreover, it would also make the format of the cache, local remotes, and HTTP remotes backwards-incompatible, so this would be quite a significant change to the fundamental design of dvc!

If we started chunking the files that we store in the cache, and we also started packing small files together, then parallelizing over files would speed up the hash computation for any number of files (both large and small). However, as I said above, this would be a major change.