question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

output/dependency/remote: slow performance for remote directories

See original GitHub issue

Hi @efiop and @mroutis,

The problem: As talked about at Discord (see from this message and the next 14 messages) I have tried out the new feature (#1654) of using remote folders as DVC stage dependencies and outputs, however, I found the runtime extremely slow.

My test example: I have created the following code example to test out the reason for the slow runtime. The code contains:

  • rsync_dataset.sh: a local bash script which performs rsync of a remote folder over ssh
  • rsync_dataset.dvc: the DVC stage for executing rsync_dataset.sh which as the remote folder as a stage output. Note that the remote output folder address “remote://ahsoka_project_data/PS_141_test_output_data_folder” is expanded by my DVC configuration to the SSH remote “ssh://fogh@ahsoka.vfltest.dk:22/scratch/dvc_project_cache/PS/”, this URI is also seen in the log file.
  • rsync_dataset.log: the verbose log from running dvc repro rsync_dataset.dvc -vf > rsync_dataset.log.

(EDIT - I totally forgot to attach the .py, .dvc and .log file: rsync_dataset.log)

rsync_dataset.sh
#!/bin/bash
#
# A script for coping the BDICG data in the
# "raster_data_cloudless_2018_12_06_10_15_29" folder containing the netCDF
# dataset per field to the PS data directory.
#
# Run the script locally. It connects to Ahsoka and initiates the copy.

echo ‘“rsync_data_set.sh” began…’

AHSOKA=ahsoka.vfltest.dk BDICG_DATADIR=/scratch/dvc_users/fogh/PS/PS_141_test_dependency_data_folder/ PS_DIR=/scratch/dvc_users/fogh/PS/PS_141_test_output_data_folder/

ssh fogh@${AHSOKA} rsync -avh --stats --info=progress2 ${BDICG_DATADIR} ${PS_DIR}

echo ‘“rsync_data_set.sh” finished.’

rsync_dataset.dvc
cmd: bash rsync_dataset.sh
deps:
- md5: 6d1ed0801f6d6d065b99195771b1cb92
  path: rsync_dataset.sh
outs:
- cache: true
  md5: c0049518603c8f0154509c54b4630238.dir
  metric: false
  path: remote://ahsoka_project_data/PS_141_test_output_data_folder
  persist: false
wdir: .
md5: 1cfcba0bc0eed0e8e5b6fd1122cb5cc8

To answer @mroutis (as you asked for in this message): my DVC cache is remote and configured like described here https://github.com/PeterFogh/dvc_dask_use_case/blob/master/README.md. Wrt. file size - on the remote server the data folder “/scratch/dvc_users/fogh/PS/PS_141_test_dependency_data_folder/” contains 10 netCDF files:

$ ls -lh
total 71M
-rw-r--r-- 1 fogh hpcusers 2.8M May 15 13:03 0.nc
-rw-r--r-- 1 fogh hpcusers 6.5M May 15 13:03 1.nc
-rw-r--r-- 1 fogh hpcusers 4.7M May 15 13:03 2.nc
-rw-r--r-- 1 fogh hpcusers 1.4M May 15 13:03 3.nc
-rw-r--r-- 1 fogh hpcusers 3.0M May 15 13:03 4.nc
-rw-r--r-- 1 fogh hpcusers 3.1M May 15 13:03 5.nc
-rw-r--r-- 1 fogh hpcusers 3.5M May 15 13:03 6.nc
-rw-r--r-- 1 fogh hpcusers 2.8M May 15 13:03 7.nc
-rw-r--r-- 1 fogh hpcusers 2.8M May 15 13:03 8.nc
-rw-r--r-- 1 fogh hpcusers 3.0M May 15 13:03 9.nc

However, this is only a toy example, as our actual pipeline has a folder containing approx. 300 netCDF files, but each file is still of a similar size (approx. 5 MB).

My DVC is installed using pip and my DVC version is:

> conda deactivate && conda activate py37_v3 && dvc version
DVC version: 0.40.2
Python version: 3.7.3
Platform: Linux-4.4.0-43-Microsoft-x86_64-with-debian-stretch-sid

Example runtime: The runtime of dvc repro rsync_dataset.dvc -vf is:

$ time dvc repro rsync_dataset.dvc -v > rsync_dataset.log
real    2m30.600s
user    0m5.750s
sys     0m4.891s

and the as seen in rsync_dataset.log the runtime of rsync_dataset.sh is less than 1 second. But as I told on Discord the runtime of our actual pipeline is 39 minutes of a folder of 1.8 GB with 287 files.

I have also computed the md5 checksum of all the files in the remote - it has a runtime less than 1 second:

$ time find PS_141_test_dependency_data_folder/ -type f -exec md5sum {} \;
74ff8e3c0bd44f6487840df0965ed5c3  PS_141_test_dependency_data_folder/7.nc
608572e058c9753026392bbfead38a95  PS_141_test_dependency_data_folder/0.nc
1f412bd9a4be8b5886aa2ec24b53ef48  PS_141_test_dependency_data_folder/9.nc
dd336099ddbd0fab05301719008a210b  PS_141_test_dependency_data_folder/3.nc
5fa4a3f6665a918a4af5d8699d151956  PS_141_test_dependency_data_folder/8.nc
f9131d06bb09240e4dd2437735de506c  PS_141_test_dependency_data_folder/5.nc
2deaef6a7c55f2ce10d94f3593373001  PS_141_test_dependency_data_folder/6.nc
6be099278fde604c716eae23e7f3b70a  PS_141_test_dependency_data_folder/4.nc
51eac88b1630568765bd6e33b6e72ab7  PS_141_test_dependency_data_folder/2.nc

real    0m0.180s
user    0m0.163s
sys     0m0.017s

Suspensions for slow runtime: I suspect the slow runtime is because DVC performs md5 checksums checks 3 times for each file in the remote folder.

  1. check if the file exists in the cache before executing the stage script - an example from the log is DEBUG: cache 'ssh://fogh@ahsoka.vfltest.dk:22/scratch/dvc_project_cache/PS/60/8572e058c9753026392bbfead38a95' expected '608572e058c9753026392bbfead38a95' actual '608572e058c9753026392bbfead38a95'
  2. check if the file exists in the cache after executing the stage script - an example from the log is DEBUG: cache 'ssh://fogh@ahsoka.vfltest.dk:22/scratch/dvc_project_cache/PS/60/8572e058c9753026392bbfead38a95' expected '608572e058c9753026392bbfead38a95' actual '608572e058c9753026392bbfead38a95' - note that the logged line is the same both before and after executing the stage script.
  3. check if the file has changed - an example from the log is DEBUG: checking if 'remote://ahsoka_project_data/PS_141_test_output_data_folder/0.nc'('{'md5': '608572e058c9753026392bbfead38a95'}') has changed.

I also suspect the slow runtime is due to the many SSH connections created for each file checksum.

Question: Is it possible to optimize the checking the checksums of remote folders, to improve the runtime?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
paredcommented, May 16, 2019

From private discussion: Worth noting that state database has uniqueness constraint on inode, so there is possible case where we have state database with mixed local/ssh entries, and we start overriding one with another. It would be desirable to handle this potential problem in this task.

1reaction
PeterFoghcommented, Sep 27, 2019

@efiop After trying the newer versions of DVC, I’m happy with the speed. It can cache 5 GB in seconds. Now, I just need the cache to work in my pipeline, see https://github.com/iterative/dvc/issues/2542. Thus, I propose we close this issue 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

linux - Keep Remote Directory Up-to-date
Looking through the solutions, I see a couple which solve the general problem of keeping a remote directory in sync with a local...
Read more >
Remote Development Tips and Tricks - Visual Studio Code
Visual Studio Code Remote Development troubleshooting tips and tricks for SSH, Containers, and the Windows Subsystem for Linux (WSL)
Read more >
Understanding dependency resolution
This chapter covers the way dependency resolution works inside Gradle. ... but is released for slow operations such as downloading remote artifacts.
Read more >
How To Optimize Video Performance on RDP
RDP's video performance is directly dependent upon two (2) things. The remote computer's specifications (CPU, RAM, Hard disk etc.) and your network bandwidth....
Read more >
Environment Dependencies — Ray 2.2.0
The specified local directory will automatically be pushed to the cluster nodes when ray.init() is called. You can also specify files via a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found