how to `gsutil cp` gzip files without decompressing them?
See original GitHub issueReading GCS decompressive transcoding documentation, I understand that the only way to retrieve a gziped file stored on GCS with Content-Encoding: gzip
in its compressed state is to pass "Accept-Encoding: gzip"
header when requesting it.
When trying to do so using gsutil
, I have an error:
$ gsutil ls -L gs://xxx/0.json.gz | grep 'Content-'
Content-Encoding: gzip
Content-Length: 129793
Content-Type: text/plain
$ gsutil -h "Accept-Encoding: gzip" cp gs://xxx/0.json.gz .
ArgumentException: Invalid header specified: accept-encoding: gzip
(I know this example is probably bad; the extension shouldn’t be explicitely set to .gz
but I have to work with this right now)
My guess is that gsutil
performs client-side decompression (as suggested by gsutil cp
documentation) and so it prevents from passing the Accept-Encoding
header.
My question is then how can I use gsutil
to download a gzip file that has his metadata set to Content-Encoding: gzip
without decompressing it (and without having to set other metadata like Cache-control: no-transform
if that would be a workaround)?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:7
It doesn’t look like there’s a way to disable the auto-decompression behavior for
gsutil cp
. For one-off use cases,gsutil cat
will skip the decompression:But I realize it’s very slow and painful to run a separate invocation of gsutil for every object like this. We should provide some sort of behavior to prevent auto-decompression when downloading objects via cp/mv/rsync.
Hey 😃
Any update on this one ?
It is quite inconvenient when working with lots of gzip files in a data science environment (that represents several TBs in gzip format) I did not find a good workaround for the moment (I will try the cat version but the lack of md5sum checks might lead to other problems 😕)
Maybe you know a script with the python API (or any other language) that does that ?
Thanks.
PS (edit) : After looking up code I found https://github.com/hackersandslackers/googlecloud-storage-tutorial/blob/master/main.py . I used this one with raw_download=True on download_to_filename and it works but slower than gsutil a priori. (which is a problem with the large amount of data I need to transfer)