question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to `gsutil cp` gzip files without decompressing them?

See original GitHub issue

Reading GCS decompressive transcoding documentation, I understand that the only way to retrieve a gziped file stored on GCS with Content-Encoding: gzip in its compressed state is to pass "Accept-Encoding: gzip" header when requesting it.

When trying to do so using gsutil, I have an error:

$ gsutil ls -L gs://xxx/0.json.gz | grep 'Content-'
    Content-Encoding:       gzip
    Content-Length:         129793
    Content-Type:           text/plain
$ gsutil -h "Accept-Encoding: gzip" cp gs://xxx/0.json.gz .
ArgumentException: Invalid header specified: accept-encoding: gzip

(I know this example is probably bad; the extension shouldn’t be explicitely set to .gz but I have to work with this right now)

My guess is that gsutil performs client-side decompression (as suggested by gsutil cp documentation) and so it prevents from passing the Accept-Encoding header.

My question is then how can I use gsutil to download a gzip file that has his metadata set to Content-Encoding: gzip without decompressing it (and without having to set other metadata like Cache-control: no-transform if that would be a workaround)?

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:1
  • Comments:7

github_iconTop GitHub Comments

11reactions
houglumcommented, Apr 12, 2018

It doesn’t look like there’s a way to disable the auto-decompression behavior for gsutil cp. For one-off use cases, gsutil cat will skip the decompression:

$ gsutil cat gs://bucket/obj.gz > /destination/path/obj.gz

But I realize it’s very slow and painful to run a separate invocation of gsutil for every object like this. We should provide some sort of behavior to prevent auto-decompression when downloading objects via cp/mv/rsync.

4reactions
benoitbayolcommented, Jun 24, 2020

Hey 😃

Any update on this one ?

It is quite inconvenient when working with lots of gzip files in a data science environment (that represents several TBs in gzip format) I did not find a good workaround for the moment (I will try the cat version but the lack of md5sum checks might lead to other problems 😕)

Maybe you know a script with the python API (or any other language) that does that ?

Thanks.

PS (edit) : After looking up code I found https://github.com/hackersandslackers/googlecloud-storage-tutorial/blob/master/main.py . I used this one with raw_download=True on download_to_filename and it works but slower than gsutil a priori. (which is a problem with the large amount of data I need to transfer)

Read more comments on GitHub >

github_iconTop Results From Across the Web

gsutil cp is unzipping my gzipped files - Stack Overflow
When I try to download the files running command: gsutil -m cp -r gs://my-bucket-name/path/to/dir/ , it downloads the files then immediately ...
Read more >
cp - Copy files and objects | Cloud Storage - Google Cloud
The gsutil cp command allows you to copy data between your local file system and the cloud, within the cloud, and between cloud...
Read more >
Serving compressed files from Google Cloud Storage - jaro.blog
Searching for how to server gzip-compressed static files from Google ... upon receiving them, can decompress them to a plain text format.
Read more >
Compress/Decompress files in Cloud Storage
Or suppose you have thousands of VCFs, and you did not compress them when originally copying them to Google Cloud Storage, but these...
Read more >
Google Cloud Storage gzip
If you ended up having a zip file on your Google Cloud Storage bucket because you had to move large files from another...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found