output/dependency/remote: slow performance for remote directories
See original GitHub issueThe problem: As talked about at Discord (see from this message and the next 14 messages) I have tried out the new feature (#1654) of using remote folders as DVC stage dependencies and outputs, however, I found the runtime extremely slow.
My test example: I have created the following code example to test out the reason for the slow runtime. The code contains:
rsync_dataset.sh
: a local bash script which performs rsync of a remote folder over sshrsync_dataset.dvc
: the DVC stage for executingrsync_dataset.sh
which as the remote folder as a stage output. Note that the remote output folder address “remote://ahsoka_project_data/PS_141_test_output_data_folder” is expanded by my DVC configuration to the SSH remote “ssh://fogh@ahsoka.vfltest.dk:22/scratch/dvc_project_cache/PS/”, this URI is also seen in the log file.rsync_dataset.log
: the verbose log from runningdvc repro rsync_dataset.dvc -vf > rsync_dataset.log
.
(EDIT - I totally forgot to attach the .py, .dvc and .log file: rsync_dataset.log)
rsync_dataset.sh
#!/bin/bash # # A script for coping the BDICG data in the # "raster_data_cloudless_2018_12_06_10_15_29" folder containing the netCDF # dataset per field to the PS data directory. # # Run the script locally. It connects to Ahsoka and initiates the copy.echo ‘“rsync_data_set.sh” began…’
AHSOKA=ahsoka.vfltest.dk BDICG_DATADIR=/scratch/dvc_users/fogh/PS/PS_141_test_dependency_data_folder/ PS_DIR=/scratch/dvc_users/fogh/PS/PS_141_test_output_data_folder/
ssh fogh@${AHSOKA} rsync -avh --stats --info=progress2 ${BDICG_DATADIR} ${PS_DIR}
echo ‘“rsync_data_set.sh” finished.’
rsync_dataset.dvc
cmd: bash rsync_dataset.sh deps: - md5: 6d1ed0801f6d6d065b99195771b1cb92 path: rsync_dataset.sh outs: - cache: true md5: c0049518603c8f0154509c54b4630238.dir metric: false path: remote://ahsoka_project_data/PS_141_test_output_data_folder persist: false wdir: . md5: 1cfcba0bc0eed0e8e5b6fd1122cb5cc8
To answer @mroutis (as you asked for in this message): my DVC cache is remote and configured like described here https://github.com/PeterFogh/dvc_dask_use_case/blob/master/README.md. Wrt. file size - on the remote server the data folder “/scratch/dvc_users/fogh/PS/PS_141_test_dependency_data_folder/” contains 10 netCDF files:
$ ls -lh
total 71M
-rw-r--r-- 1 fogh hpcusers 2.8M May 15 13:03 0.nc
-rw-r--r-- 1 fogh hpcusers 6.5M May 15 13:03 1.nc
-rw-r--r-- 1 fogh hpcusers 4.7M May 15 13:03 2.nc
-rw-r--r-- 1 fogh hpcusers 1.4M May 15 13:03 3.nc
-rw-r--r-- 1 fogh hpcusers 3.0M May 15 13:03 4.nc
-rw-r--r-- 1 fogh hpcusers 3.1M May 15 13:03 5.nc
-rw-r--r-- 1 fogh hpcusers 3.5M May 15 13:03 6.nc
-rw-r--r-- 1 fogh hpcusers 2.8M May 15 13:03 7.nc
-rw-r--r-- 1 fogh hpcusers 2.8M May 15 13:03 8.nc
-rw-r--r-- 1 fogh hpcusers 3.0M May 15 13:03 9.nc
However, this is only a toy example, as our actual pipeline has a folder containing approx. 300 netCDF files, but each file is still of a similar size (approx. 5 MB).
My DVC is installed using pip and my DVC version is:
> conda deactivate && conda activate py37_v3 && dvc version
DVC version: 0.40.2
Python version: 3.7.3
Platform: Linux-4.4.0-43-Microsoft-x86_64-with-debian-stretch-sid
Example runtime:
The runtime of dvc repro rsync_dataset.dvc -vf
is:
$ time dvc repro rsync_dataset.dvc -v > rsync_dataset.log
real 2m30.600s
user 0m5.750s
sys 0m4.891s
and the as seen in rsync_dataset.log
the runtime of rsync_dataset.sh
is less than 1 second. But as I told on Discord the runtime of our actual pipeline is 39 minutes of a folder of 1.8 GB with 287 files.
I have also computed the md5 checksum of all the files in the remote - it has a runtime less than 1 second:
$ time find PS_141_test_dependency_data_folder/ -type f -exec md5sum {} \;
74ff8e3c0bd44f6487840df0965ed5c3 PS_141_test_dependency_data_folder/7.nc
608572e058c9753026392bbfead38a95 PS_141_test_dependency_data_folder/0.nc
1f412bd9a4be8b5886aa2ec24b53ef48 PS_141_test_dependency_data_folder/9.nc
dd336099ddbd0fab05301719008a210b PS_141_test_dependency_data_folder/3.nc
5fa4a3f6665a918a4af5d8699d151956 PS_141_test_dependency_data_folder/8.nc
f9131d06bb09240e4dd2437735de506c PS_141_test_dependency_data_folder/5.nc
2deaef6a7c55f2ce10d94f3593373001 PS_141_test_dependency_data_folder/6.nc
6be099278fde604c716eae23e7f3b70a PS_141_test_dependency_data_folder/4.nc
51eac88b1630568765bd6e33b6e72ab7 PS_141_test_dependency_data_folder/2.nc
real 0m0.180s
user 0m0.163s
sys 0m0.017s
Suspensions for slow runtime: I suspect the slow runtime is because DVC performs md5 checksums checks 3 times for each file in the remote folder.
- check if the file exists in the cache before executing the stage script - an example from the log is
DEBUG: cache 'ssh://fogh@ahsoka.vfltest.dk:22/scratch/dvc_project_cache/PS/60/8572e058c9753026392bbfead38a95' expected '608572e058c9753026392bbfead38a95' actual '608572e058c9753026392bbfead38a95'
- check if the file exists in the cache after executing the stage script - an example from the log is
DEBUG: cache 'ssh://fogh@ahsoka.vfltest.dk:22/scratch/dvc_project_cache/PS/60/8572e058c9753026392bbfead38a95' expected '608572e058c9753026392bbfead38a95' actual '608572e058c9753026392bbfead38a95'
- note that the logged line is the same both before and after executing the stage script. - check if the file has changed - an example from the log is
DEBUG: checking if 'remote://ahsoka_project_data/PS_141_test_output_data_folder/0.nc'('{'md5': '608572e058c9753026392bbfead38a95'}') has changed.
I also suspect the slow runtime is due to the many SSH connections created for each file checksum.
Question: Is it possible to optimize the checking the checksums of remote folders, to improve the runtime?
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (9 by maintainers)
From private discussion: Worth noting that state database has uniqueness constraint on inode, so there is possible case where we have state database with mixed local/ssh entries, and we start overriding one with another. It would be desirable to handle this potential problem in this task.
@efiop After trying the newer versions of DVC, I’m happy with the speed. It can cache 5 GB in seconds. Now, I just need the cache to work in my pipeline, see https://github.com/iterative/dvc/issues/2542. Thus, I propose we close this issue 😃