investigate possible checkout performance regression since 0.20.0
See original GitHub issuetdeboissiere on discord reports that dvc pull
(which didn’t download anything, so dvc checkout
is the culprit) on 0.20.0 takes 111s, but on 0.20.3 160s. Need to investigate if we have a regression in checkout performance.
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
Spark performance regression in BQSR and HC #4376 - GitHub
I've been looking at 2bit performance today, comparing ADAM release version 0.20.0 to release version 0.23.0 and to git HEAD (0.24.0-SNAPSHOT), ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thank you guys for your patience. With add/checkout it turned out, that we are not shortcircuiting log messages early enough in non-verbose modes, causing such delays(up to 50% speedup on a test with a directory with 100K files). This was the cause of regression after 0.20.0, since we’ve added more debug msgs after it. Will release new version ASAP. #1331 is still relevant since we could improve the performance much more. Also,
gc
issue described by @IamGianluca is also a separate one, so created https://github.com/iterative/dvc/issues/1429 to track that. Actively working on optimizations right now.@IamGianluca Whoa, that is a lot of time 🙁 Note that if you remove .dvc/cache, you will loose cache for your pipeline inputs as well, so you won’t be able to reproduce the pipeline as is, you would have to manually place input data back into your workspace. If you know where to find that data, you could indeed
rm -rf .dvc/cache
for now. Otherwise you need to at least back up that input data somewhere, either by manually coping it, or creating a directory, making it advc remote
and pushing data there(eitherdvc push
to backup all currently used cache, ordvc push data.dvc
for all input data).I’m investigating right now. Thank you for your patience guys.