question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GCP download very slow for slightly large files

See original GitHub issue

Problem description

I am trying to download a slightly large file (1.1GB) and the attached code with smart_open takes a long time (15m40s) while a gsutil cp takes about 25s. The storage.blob API of google is also quite fast (and comparable to gsutil).

Steps/code to reproduce the problem

Code used:

import time
import sys
from smart_open import open as cloud_open

gcs_uri = "<redacted file name>"
dl_path = "./test.pkl"

current_secs_func = lambda: int(round(time.time()))

chunk_size = 256 * 1024 * 1024  # 256M
count = 0
with cloud_open(gcs_uri, mode="rb") as cloud_fd:  # Same slowness even with `transport_params={'min_part_size': chunk_size}`
    with open(dl_path, mode="wb+") as local_fd:
        print("Start time: ", current_secs_func())
        sys.stdout.flush()
        while True:
            current = current_secs_func()
            data = cloud_fd.read(chunk_size)
            print("Read chunk [{}] of at most size [{}] from [{}] to [{}] in [{}] secs".format(count, chunk_size, gcs_uri, dl_path, current_secs_func() - current))
            sys.stdout.flush()
            if not data:
                break
            count += 1
            current = current_secs_func()
            local_fd.write(data)
            print("Wrote chunk [{}] of at most size [{}] from [{}] to [{}] in [{}] secs".format(count, chunk_size, gcs_uri, dl_path, current_secs_func() - current))
            sys.stdout.flush()

Nearly each chunk read above takes close to 230s. (Write to output file on local FS has sub-second latency).

Versions

Please provide the output of:

Python 3.7.7 (default, Apr 18 2020, 02:59:53)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform, sys, smart_open
>>> print(platform.platform())
Linux-5.4.0-1011-gcp-x86_64-with-Ubuntu-20.04-focal
>>> print("Python", sys.version)
Python 3.7.7 (default, Apr 18 2020, 02:59:53)
[GCC 9.3.0]
>>> print("smart_open", smart_open.__version__)
smart_open 3.0.0

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:13 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
petedannemanncommented, Nov 6, 2020
pytest integration-tests/test_gcs.py::test_gcs_performance

--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s)            Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_gcs_performance     2.1291  2.2363  2.1769  0.0431  2.1742  0.0688       2;0  0.4594       5           1
-------------------------------------------------------------------------------------------------------------

Yep, this is much slower than it should be. I remember running initial benchmarks during development and seeing numbers much lower than this. I’m not sure if something has changed in the code or if my memory is failing me / I ran improper benchmarks, but these numbers are definitely unacceptable. I can try to do some profiling soon to figure out where the bottlenecks are.

2reactions
arunmkcommented, Nov 6, 2020

I have tried various options including using a transport, trying to read all 1.1GB at once without chunking etc. They are all in a similar ballpark and very slow compared to gsutil. Also initially tried with v2.1.0 and that was also taking similar times.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Optimizing your Cloud Storage performance - Google Cloud
That means that for both uploads and downloads, Cloud Storage performance is at its best for larger requests of around 1MB in size....
Read more >
Sometimes get super-slow download rates from Google Cloud ...
Usually we get great download speeds from Google Cloud Storage - I think ~10-25 megabytes per second is common. However, sometimes (and becoming ......
Read more >
Optimize data transfer between Compute Engine and Cloud ...
Are you experiencing slow transfer speeds between your GCE VM and a Cloud Storage bucket? Then read on to learn how to maximize...
Read more >
How to fix incosistent and slow Google Cloud Storage ...
A common problem I've seen in GCE is that due to gcloud clients having a heavy DNS dependency, that bursts of traffic are...
Read more >
How can I improve download time from GCS for small files ...
If the gsutil cp command is anything like FTP, there is a lot of slow network stuff happening for each specific file transfer....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found