Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to `gsutil cp` gzip files without decompressing them?

See original GitHub issue

Reading GCS decompressive transcoding documentation, I understand that the only way to retrieve a gziped file stored on GCS with Content-Encoding: gzip in its compressed state is to pass "Accept-Encoding: gzip" header when requesting it.

When trying to do so using gsutil, I have an error:

$ gsutil ls -L gs://xxx/0.json.gz | grep 'Content-'
    Content-Encoding:       gzip
    Content-Length:         129793
    Content-Type:           text/plain
$ gsutil -h "Accept-Encoding: gzip" cp gs://xxx/0.json.gz .
ArgumentException: Invalid header specified: accept-encoding: gzip

(I know this example is probably bad; the extension shouldn’t be explicitely set to .gz but I have to work with this right now)

My guess is that gsutil performs client-side decompression (as suggested by gsutil cp documentation) and so it prevents from passing the Accept-Encoding header.

My question is then how can I use gsutil to download a gzip file that has his metadata set to Content-Encoding: gzip without decompressing it (and without having to set other metadata like Cache-control: no-transform if that would be a workaround)?

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:7

Top GitHub Comments

11reactions

houglumcommented, Apr 12, 2018

It doesn’t look like there’s a way to disable the auto-decompression behavior for gsutil cp. For one-off use cases, gsutil cat will skip the decompression:

$ gsutil cat gs://bucket/obj.gz > /destination/path/obj.gz

But I realize it’s very slow and painful to run a separate invocation of gsutil for every object like this. We should provide some sort of behavior to prevent auto-decompression when downloading objects via cp/mv/rsync.

4reactions

benoitbayolcommented, Jun 24, 2020

Hey 😃

Any update on this one ?

It is quite inconvenient when working with lots of gzip files in a data science environment (that represents several TBs in gzip format) I did not find a good workaround for the moment (I will try the cat version but the lack of md5sum checks might lead to other problems 😕)

Maybe you know a script with the python API (or any other language) that does that ?

Thanks.

PS (edit) : After looking up code I found https://github.com/hackersandslackers/googlecloud-storage-tutorial/blob/master/main.py . I used this one with raw_download=True on download_to_filename and it works but slower than gsutil a priori. (which is a problem with the large amount of data I need to transfer)