Large memory consumption and slow hashing on `dvc status` after `dvc import` of parent of dvc-tracked folder
See original GitHub issueBug Report
status: slow performance and high memory consumption after importing parent of a dvc-tracked folder
Description
I have a dvc repository with the following structure
parent/subfolder/[many files]
I track the subfolder.
In a different repo, I want to import the parent folder. After importing it, if I run dvc status, the hash computation is really slow (around 13/s while the dvc add in the first place was faster than 800/s). Moreover, the memory consumption increases steadily over time and is more than 8GB for around 10k files (each containing only a few characters).
Reproduce
# Create first dvc repo from which to import later
mkdir dvc-source
cd dvc-source
git init
dvc init
git commit -m "empty dataset"
# create files
mkdir -p parent/subfolder
for n in $(seq 10000); do; echo $n > parent/subfolder/$n.txt; done;
dvc add parent/subfolder # ~800 md5/s
git add parent/.gitignore parent/subfolder.dvc
git commit -m "add dataset"
# Create importing repo
cd ..
mkdir importing-repo
cd importing-repo
git init
dvc init
git commit -m "empty dataset"
dvc import ../dvc-source/ parent
dvc status # slow <20 md5/s and memory allocation goes up
Expected
If I dvc import parent/subfolder, then dvc status is fast and memory consumption is low. I would expect this behavior also after importing parent.
Environment information
Output of dvc doctor:
$ dvc doctor
DVC version: 2.7.3 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.11.0-34-generic-x86_64-with-glibc2.29
Supports:
azure (adlfs = 2021.9.1, knack = 0.8.2, azure-identity = 1.6.1),
gdrive (pydrive2 = 1.9.3),
gs (gcsfs = 2021.8.1),
hdfs (pyarrow = 5.0.0),
webhdfs (hdfs = 2.6.0),
http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
s3 (s3fs = 2021.8.1, boto3 = 1.17.106),
ssh (sshfs = 2021.8.1),
oss (ossfs = 2021.8.0),
webdav (webdav4 = 0.9.1),
webdavs (webdav4 = 0.9.1)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git
Additional Information (if any):
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:7 (1 by maintainers)
Top Results From Across the Web
Large dataset let ram memory explode during caching - DVC
I believe your problem is related with Large memory consumption and slow hashing on `dvc status` after `dvc import` of parent of dvc-tracked...
Read more >repro: DVC is slow with million of files · Issue #7681 - GitHub
dvc status takes 20+ minutes to calculate hashes; dvc repro fails to complete. the command finishes fine but some step after creates an ......
Read more >Data Version Control With Python and DVC - Real Python
In this tutorial, you'll learn how to: Use a tool called DVC to tackle some of these challenges; Track and version your datasets...
Read more >Release 3.8.1 Gev Sogomonian, Gor Arakelyan et al. - Docs
Just like activeloop/hub, Aim provides another wrapper for DVC that can be used to store DVC tracked files as Run parameter. from aim.sdk...
Read more >dvc Changelog - pyup.io
import -url: include `files` entry for cloud versioned dir dependencies by pmrowla in https://github.com/iterative/dvc/pull/8528 * ci: bench: use 3.11 in ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Thanks for your report, especially for your detailed reproduce script. Tried it on my computer, it creates an external repo for every single file. I guess why it was so slow.
It seems the data registry repo clone is cached but the cached repo queried for each file. Why is this necessary? Essentially the repo stores only a single
.dvcfile. Couldn’t this be kept in memory?