Support reflinks for local remotes
See original GitHub issueDVC uses reflinks for the cache, but it does not seem to use reflinks for local remotes. In the following example, I will create a large file, add it to DVC, and push it to a local remote. Disk space is consumed twice: when the file is created and when it is pushed to a local remote. When I remove the content of the local remote and manually reflink the content of the cache to the local remote, the disk space is reclaimed:
$ dvc --version
0.40.2
$ mkdir test remote_cache
$ cd test
$ dvc init --no-scm
$ df -h
/dev/sdb6 932G 756G 172G 82% /home
$ dd if=/dev/zero of=file.txt count=$((10*1024)) bs=1048576
$ ls -lh file.txt
-rw-r--r-- 1 witiko witiko 10G May 27 19:03 file.txt
$ df -h
/dev/sdb6 932G 766G 162G 83% /home
$ df -h
/dev/sdb6 932G 766G 162G 83% /home
$ dvc add file.txt
$ df -h
/dev/sdb6 932G 766G 162G 83% /home
$ df -h
/dev/sdb6 932G 766G 162G 83% /home
$ dvc remote add -d local ../remote_cache
$ dvc push
$ df -h
/dev/sdb6 932G 776G 152G 84% /home
$ df -h
/dev/sdb6 932G 776G 152G 84% /home
$ rm -rf ../remote_cache/*
$ cp -r --reflink=always .dvc/cache/* ../remote_cache
$ df -h
/dev/sdb6 932G 766G 162G 83% /home
I can circumvent the issue by setting cache.dir
to ../../remote_cache
, but that will affect users who download Git repositories with my experiments. Therefore, my preferred workaround is to symlink .dvc/cache
to ../../remote_cache
:
$ cd .dvc
$ rm -rf cache
$ ln -s ../../remote_cache cache
However, this workaround does not work when you have several local remotes, in which case you would need to symlink in the other direction (from the local remotes to .dvc/cache
). The workaround also makes it impossible to defer publishing the latest changes in your cache.
In conclusion, it would be useful if the local remotes supported reflinks, like the cache does.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:14 (13 by maintainers)
@efiop Great! I use Btrfs whenever possible so reflinking works for me as is extremely convenient. Thanks @Witiko!
@Witiko I agree 100% with that. We should not use hardlink/symlink there, because they don’t protect the link from corruption. We should only use
reflink
if it is available and fallback tocopy
if it is not. That could be done by default, without any config options to enable/disable that.