question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support reflinks for local remotes

See original GitHub issue

DVC uses reflinks for the cache, but it does not seem to use reflinks for local remotes. In the following example, I will create a large file, add it to DVC, and push it to a local remote. Disk space is consumed twice: when the file is created and when it is pushed to a local remote. When I remove the content of the local remote and manually reflink the content of the cache to the local remote, the disk space is reclaimed:

$ dvc --version
0.40.2
$ mkdir test remote_cache
$ cd test
$ dvc init --no-scm
$ df -h
/dev/sdb6                               932G  756G  172G  82% /home
$ dd if=/dev/zero of=file.txt count=$((10*1024)) bs=1048576
$ ls -lh file.txt
-rw-r--r-- 1 witiko witiko 10G May 27 19:03 file.txt
$ df -h
/dev/sdb6                               932G  766G  162G  83% /home
$ df -h
/dev/sdb6                               932G  766G  162G  83% /home
$ dvc add file.txt
$ df -h
/dev/sdb6                               932G  766G  162G  83% /home
$ df -h
/dev/sdb6                               932G  766G  162G  83% /home
$ dvc remote add -d local ../remote_cache
$ dvc push
$ df -h
/dev/sdb6                               932G  776G  152G  84% /home
$ df -h
/dev/sdb6                               932G  776G  152G  84% /home
$ rm -rf ../remote_cache/*
$ cp -r --reflink=always .dvc/cache/* ../remote_cache
$ df -h
/dev/sdb6                               932G  766G  162G  83% /home

I can circumvent the issue by setting cache.dir to ../../remote_cache, but that will affect users who download Git repositories with my experiments. Therefore, my preferred workaround is to symlink .dvc/cache to ../../remote_cache:

$ cd .dvc
$ rm -rf cache
$ ln -s ../../remote_cache cache

However, this workaround does not work when you have several local remotes, in which case you would need to symlink in the other direction (from the local remotes to .dvc/cache). The workaround also makes it impossible to defer publishing the latest changes in your cache.

In conclusion, it would be useful if the local remotes supported reflinks, like the cache does.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:3
  • Comments:14 (13 by maintainers)

github_iconTop GitHub Comments

3reactions
michal-ruzickacommented, May 28, 2019

@efiop Great! I use Btrfs whenever possible so reflinking works for me as is extremely convenient. Thanks @Witiko!

3reactions
efiopcommented, May 27, 2019

We might want to discuss whether we should be trying to symlink and hardlink to the local remote when the filesystem does not support reflinks. I think we should not, because even protected symlinks and hardlinks can be easily deleted and modified, which creates a dependency between the remote and the project. If we would, then we might want to make this behavior configurable.

@Witiko I agree 100% with that. We should not use hardlink/symlink there, because they don’t protect the link from corruption. We should only use reflink if it is available and fallback to copy if it is not. That could be done by default, without any config options to enable/disable that.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to avoid data duplication between cache and workspace
According to the docs, the reflinks are supported by macOS and can be ... Cache directory: apfs on /dev/disk1s5s1 Caches: local Remotes: s3 ......
Read more >
Does Rsync support reflinks for btrfs - Reddit
Rsync doesn't support reflink yet. We've been wanting it forever. Please get those patches in good shape and merged. Then yes.
Read more >
AUR (en) - rsync-reflink - Arch Linux
Package Base: rsync-reflink. Description: A fast and versatile file copying tool for remote and local files - with reflink support.
Read more >
Rust Package Registry - crates.io
reflink. copy-on-write mechanism on supported file systems ... Xvc remote (and local) storage management. xvc-file depends on ^0.1.
Read more >
local caching of annexed files
cd my-repository git remote add cache ~/.annex-cache git config ... always uses cp --reflink=auto for local paths (cache remote was on a local...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found