Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large memory consumption and slow hashing on `dvc status` after `dvc import` of parent of dvc-tracked folder

See original GitHub issue

Bug Report

status: slow performance and high memory consumption after importing parent of a dvc-tracked folder

Description

I have a dvc repository with the following structure

parent/subfolder/[many files]

I track the subfolder.

In a different repo, I want to import the parent folder. After importing it, if I run dvc status, the hash computation is really slow (around 13/s while the dvc add in the first place was faster than 800/s). Moreover, the memory consumption increases steadily over time and is more than 8GB for around 10k files (each containing only a few characters).

Reproduce

# Create first dvc repo from which to import later
mkdir dvc-source
cd dvc-source
git init
dvc init
git commit -m "empty dataset"
# create files
mkdir -p parent/subfolder
for n in $(seq 10000); do; echo $n > parent/subfolder/$n.txt; done;
dvc add parent/subfolder  #  ~800 md5/s
git add parent/.gitignore parent/subfolder.dvc
git commit -m "add dataset"

# Create importing repo
cd ..
mkdir importing-repo
cd importing-repo
git init
dvc init
git commit -m "empty dataset"
dvc import ../dvc-source/ parent
dvc status  # slow <20 md5/s and memory allocation goes up

Expected

If I dvc import parent/subfolder, then dvc status is fast and memory consumption is low. I would expect this behavior also after importing parent.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.7.3 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.11.0-34-generic-x86_64-with-glibc2.29
Supports:
	azure (adlfs = 2021.9.1, knack = 0.8.2, azure-identity = 1.6.1),
	gdrive (pydrive2 = 1.9.3),
	gs (gcsfs = 2021.8.1),
	hdfs (pyarrow = 5.0.0),
	webhdfs (hdfs = 2.6.0),
	http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
	https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
	s3 (s3fs = 2021.8.1, boto3 = 1.17.106),
	ssh (sshfs = 2021.8.1),
	oss (ossfs = 2021.8.0),
	webdav (webdav4 = 0.9.1),
	webdavs (webdav4 = 0.9.1)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git

Additional Information (if any):

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:7 (1 by maintainers)

Top GitHub Comments

3reactions

karajan1001commented, Sep 17, 2021

Thanks for your report, especially for your detailed reproduce script. Tried it on my computer, it creates an external repo for every single file. I guess why it was so slow.

2021-09-17 22:10:38,983 DEBUG: Creating external repo ../dvc-source/@0c305a625d9eaef3d7b3cd3c4b70cf34b87da1f6
2021-09-17 22:10:38,993 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,028 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,028 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,028 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,029 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,029 DEBUG: Creating external repo ../dvc-source/@0c305a625d9eaef3d7b3cd3c4b70cf34b87da1f6
2021-09-17 22:10:39,042 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,079 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,080 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,080 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,080 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only

2reactions

weidenkacommented, Apr 20, 2022

It seems the data registry repo clone is cached but the cached repo queried for each file. Why is this necessary? Essentially the repo stores only a single .dvc file. Couldn’t this be kept in memory?

Top Results From Across the Web

Large dataset let ram memory explode during caching - DVC

I believe your problem is related with Large memory consumption and slow hashing on `dvc status` after `dvc import` of parent of dvc-tracked...

repro: DVC is slow with million of files · Issue #7681 - GitHub

dvc status takes 20+ minutes to calculate hashes; dvc repro fails to complete. the command finishes fine but some step after creates an ......

Data Version Control With Python and DVC - Real Python

In this tutorial, you'll learn how to: Use a tool called DVC to tackle some of these challenges; Track and version your datasets...

Release 3.8.1 Gev Sogomonian, Gor Arakelyan et al. - Docs

Just like activeloop/hub, Aim provides another wrapper for DVC that can be used to store DVC tracked files as Run parameter. from aim.sdk...

dvc Changelog - pyup.io

import -url: include `files` entry for cloud versioned dir dependencies by pmrowla in https://github.com/iterative/dvc/pull/8528 * ci: bench: use 3.11 in ...