question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large memory consumption and slow hashing on `dvc status` after `dvc import` of parent of dvc-tracked folder

See original GitHub issue

Bug Report

status: slow performance and high memory consumption after importing parent of a dvc-tracked folder

Description

I have a dvc repository with the following structure

parent/subfolder/[many files]

I track the subfolder.

In a different repo, I want to import the parent folder. After importing it, if I run dvc status, the hash computation is really slow (around 13/s while the dvc add in the first place was faster than 800/s). Moreover, the memory consumption increases steadily over time and is more than 8GB for around 10k files (each containing only a few characters).

Reproduce

# Create first dvc repo from which to import later
mkdir dvc-source
cd dvc-source
git init
dvc init
git commit -m "empty dataset"
# create files
mkdir -p parent/subfolder
for n in $(seq 10000); do; echo $n > parent/subfolder/$n.txt; done;
dvc add parent/subfolder  #  ~800 md5/s
git add parent/.gitignore parent/subfolder.dvc
git commit -m "add dataset"

# Create importing repo
cd ..
mkdir importing-repo
cd importing-repo
git init
dvc init
git commit -m "empty dataset"
dvc import ../dvc-source/ parent
dvc status  # slow <20 md5/s and memory allocation goes up

Expected

If I dvc import parent/subfolder, then dvc status is fast and memory consumption is low. I would expect this behavior also after importing parent.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.7.3 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.11.0-34-generic-x86_64-with-glibc2.29
Supports:
	azure (adlfs = 2021.9.1, knack = 0.8.2, azure-identity = 1.6.1),
	gdrive (pydrive2 = 1.9.3),
	gs (gcsfs = 2021.8.1),
	hdfs (pyarrow = 5.0.0),
	webhdfs (hdfs = 2.6.0),
	http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
	https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
	s3 (s3fs = 2021.8.1, boto3 = 1.17.106),
	ssh (sshfs = 2021.8.1),
	oss (ossfs = 2021.8.0),
	webdav (webdav4 = 0.9.1),
	webdavs (webdav4 = 0.9.1)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git

Additional Information (if any):

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
karajan1001commented, Sep 17, 2021

Thanks for your report, especially for your detailed reproduce script. Tried it on my computer, it creates an external repo for every single file. I guess why it was so slow.

2021-09-17 22:10:38,983 DEBUG: Creating external repo ../dvc-source/@0c305a625d9eaef3d7b3cd3c4b70cf34b87da1f6
2021-09-17 22:10:38,993 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,028 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,028 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,028 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,029 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,029 DEBUG: Creating external repo ../dvc-source/@0c305a625d9eaef3d7b3cd3c4b70cf34b87da1f6
2021-09-17 22:10:39,042 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,079 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,080 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,080 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,080 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2reactions
weidenkacommented, Apr 20, 2022

It seems the data registry repo clone is cached but the cached repo queried for each file. Why is this necessary? Essentially the repo stores only a single .dvc file. Couldn’t this be kept in memory?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Large dataset let ram memory explode during caching - DVC
I believe your problem is related with Large memory consumption and slow hashing on `dvc status` after `dvc import` of parent of dvc-tracked...
Read more >
repro: DVC is slow with million of files · Issue #7681 - GitHub
dvc status takes 20+ minutes to calculate hashes; dvc repro fails to complete. the command finishes fine but some step after creates an ......
Read more >
Data Version Control With Python and DVC - Real Python
In this tutorial, you'll learn how to: Use a tool called DVC to tackle some of these challenges; Track and version your datasets...
Read more >
Release 3.8.1 Gev Sogomonian, Gor Arakelyan et al. - Docs
Just like activeloop/hub, Aim provides another wrapper for DVC that can be used to store DVC tracked files as Run parameter. from aim.sdk...
Read more >
dvc Changelog - pyup.io
import -url: include `files` entry for cloud versioned dir dependencies by pmrowla in https://github.com/iterative/dvc/pull/8528 * ci: bench: use 3.11 in ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found