question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

directory download and uploads are slow

See original GitHub issue

Bug Report

  • This affects multiple commands, get/fetch/pull/push… so I didn’t put a tag.

Description

We have added a directory containing 70,000 small images to the Dataset Registry. There is also a tar.gz version of the dataset which is downloaded quickly:

time dvc get https://github.com/iterative/dataset-registry mnist/images.tar.gz
dvc get https://github.com/iterative/dataset-registry mnist/images.tar.gz  3.41s user 1.36s system 45% cpu 10.411 total

When I issue:

dvc get https://github.com/iterative/dataset-registry mnist/images

Screen Shot 2021-06-24 at 14 04 50

I get ~16 hours ETA for 70.000 downloads in my VPS.

This is reduced to ~3 hours on my faster local machine.

Screen Shot 2021-06-24 at 14 27 34

I didn’t wait to finish these, so the real times may be different but you get the idea.

For -j 10 it doesn’t differ much:

Screen Shot 2021-06-24 at 14 31 27

dvc pull is better, it’s takes about 20-25 minutes.

Screen Shot 2021-06-24 at 14 34 54

(At this point, while writing a new version released and the rest of the report is in 2.4.1 😄 )

dvc pull -j 100 seems to reduce the ETA to 10 minutes.

Screen Shot 2021-06-24 at 14 41 47

(I waited for dvc pull -j 100 to finish and it took ~15 minutes.)

I also had this issue while uploading the data in iterative/dataset-registry#18 and we have a discussion there.

Reproduce

git clone https://github.com/iterative/dataset-registry
cd dataset-registry
dvc pull mnist/images.dvc

or

dvc get https://github.com/iterative/dataset-registry mnist/images

Expected

We will use this dataset (and fashion-mnist similar to this) in example repositories, we would like to have some acceptable time (<2 minutes) for the whole directory to download.

Environment information

Output of dvc doctor:

Some of this report is with 2.3.0 but currently:

$ dvc doctor
DVC version: 2.4.1 (pip)
---------------------------------
Platform: Python 3.8.5 on Linux-5.4.0-74-generic-x86_64-with-glibc2.29
Supports: azure, gdrive, gs, hdfs, webhdfs, http, https, s3, ssh, oss

Discussion

DVC uses new requests.Session objects in connection and this requires new HTTP(S) connection for each file. Although the files are small, establishing a new connection for each file takes time.

There is a mechanism in HTTP/1.1 to use the same connection. but requests doesn’t support it..

Note that increasing the number of jobs doesn’t make much difference, because servers usually limit the number of connections per IP. Even if you have 100 threads/processes to download, it’s probably a small number (~4-8) of these can be connected at a time. (I’m banned from AWS once while testing the commands with large -j.)

There may be 2 solutions for this:

  • DVC can consider directories as implicit tar archives. Instead of a directory containing many files, it works with a single tar file per directory in the cache and expands them in checkout. tar and gzip are supported in Python standard library. This probably requires all Repo class to be updated though.

  • Instead of requests, DVC can use a custom solution or another library like dugong that supports HTTP pipelining. I didn’t test any HTTP pipelining solution in Python, so I can’t vouch for any of them but this may be better for all asynchronous operations using HTTP(S).

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:24 (10 by maintainers)

github_iconTop GitHub Comments

3reactions
dberenbaumcommented, Sep 14, 2022

Removing the p1 label here because this is clearly not going to be closed in the next sprint or so, which should be the goal for a p1. However, it’s still a high priority that we will continue to be working to improve.

3reactions
dberenbaumcommented, Jun 29, 2021

Also, one more experiment - what happens with -j 100 and --numworkers=100? Just to see how does it scale. Also, how much of your network bandwidth do they take?

TLDR: It looks like the multiple processes launched by s5cmd do improve performance, whereas the additional threads in dvc don’t seem to be helping much.

s5cmd --numworkers 100 finishes in about 5 minutes for me on multiple tries. Here’s a random snapshot of cpu and network activity while files were downloading:

s5cmd_cpu s5cmd_network

dvc pull -j 100 varied more with multiple attempts but took up to 20 minutes. Here are cpu and network activity snapshots:

dvc_pull_cpu dvc_pull_network
Read more comments on GitHub >

github_iconTop Results From Across the Web

Download/upload speed is very slow - Resilio Sync
If you are encountering very slow download/upload speeds, check the following reasons why it may be happening: Transferring a big number of small...
Read more >
Solved: 13 FREE Ways to Fix Google Drive Upload Slow Issue
13 Solutions to Fix the Google Drive Upload Slow Issue · Way 1: Check Google Drive Upload Speed · Way 2: Change Google...
Read more >
Slow Uploads on Google Drive: How to Fix - Alphr
Slow download and upload speeds are a common issue, but this can usually be fixed in just a few steps. This article will...
Read more >
Why is my download speed so slow? - Microsoft Community
I would suggest you refer to the below steps and check if it helps to resolve the issue. Method 1. Modify Internet Bandwidth...
Read more >
Slow Transfers While Downloading From or Uploading to ...
​Your download or upload speed may vary depending on the quality of your connection, the presence of other users on your network such...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found