question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GCS decompressive transcoding not supported when reading

See original GitHub issue

Problem description

Using smart_open (unreleased version with GCS support) to download files from GCS with transparent decompressive transcoding enabled may lead to incomplete files being downloaded depending on the compressed file size.

With Google Cloud Storage there is the option to store gzip-compressed files & use decompressive transcoding to transparently decompress these when downloading. Decompression is thenhandled by Google servers. In this case, the filename wouldn’t have any compression extension (eg. file.csv), however when inspecting it’s metadata, it would contain something like that:

{
    "Content-Type": "text/csv; charset=utf-8",
    "Content-Encoding": "gzip"
}

This would be fine if it weren’t for the fact that in such cases, Blob()._size will return the compressed size. Since smart_open uses this to understand when to stop reading, it results in incomplete files.

Steps/code to reproduce the problem

write ~400KB file (larger than smart_opens’s default buffer size)

$ cat /dev/urandom | gtr -dc A-Za-z0-9 | head -c 400000 > rand.txt

$ ls -l  
total 1024
-rw-r--r--  1 gustavomachado  staff  524288 Feb 17 12:34 rand.txt

upload file to GCS

$ gsutil cp -Z ./rand.txt gs://my-bucket/
Copying file://./rand.txt [Content-Type=text/plain]...
- [1 files][293.8 KiB/293.8 KiB]                                                
Operation completed over 1 objects/293.8 KiB. 

resulting (compressed) file is 293.8 KiB.

check file metadata

$ gsutil stat gs://my-bucket/rand.txt                       
gs://my-bucket/rand.txt:
    Creation time:          Mon, 17 Feb 2020 13:45:36 GMT
    Update time:            Mon, 17 Feb 2020 13:45:36 GMT
    Storage class:          MULTI_REGIONAL
    Cache-Control:          no-transform
    Content-Encoding:       gzip
    Content-Language:       en
    Content-Length:         300842
    Content-Type:           text/plain
    Hash (crc32c):          Ko+ooA==
    Hash (md5):             8C6OlwZIR+fgRMy2xmQqLw==
    ETag:                   CNWW+Kjc2OcCEAE=
    Generation:             1581947136379733
    Metageneration:         1

download file using smart_open (gcloud credentials already set)

>>> from smart_open import open
>>> with open('gs://my-bucket/rand.txt', 'r') as fin:
...     with open('downloaded.txt', 'w') as fout:
...         for line in fin:
...             fout.write(line)
... 
348550

check resulting file size

$ ls -l
total 1472
-rw-r--r--  1 gustavomachado  staff  348550 Feb 17 14:48 downloaded.txt
-rw-r--r--  1 gustavomachado  staff  400000 Feb 17 14:45 rand.txt

original file is 400KB however downloaded file is 348KB. not sure why it’s still bigger than the 300842 reported by Google, though.

Versions

Please provide the output of:

>>> import platform, sys, smart_open
>>> print(platform.platform())
Darwin-18.7.0-x86_64-i386-64bit
>>> print("Python", sys.version)
Python 3.7.2 (default, Dec  9 2019, 14:10:57) 
[Clang 10.0.1 (clang-1001.0.46.4)]
>>> print("smart_open", smart_open.__version__)
smart_open 1.9.0

smart_open has been pinned to 72818ca, installed with

$ pip install git+git://github.com/RaRe-Technologies/smart_open.git@72818ca6d3a0a99e1717ab31db72bf109ac5ce65 

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software

Possible solutions

Setting buffer_size to a value larger than the compressed file size will of course download it in it’s entirety, but for large files that would mean loading the entire file into memory.

A reasonable option would be to check Blob().content_type, and if it is equal to 'gzip', call Blob().download_as_string with raw_download=True, and then somehow handle decompression internally with the already-existing decompression mechanisms

If the maintainers agree this would be a viable solution, I’ll be happy to provide a PR implementing it.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:2
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
hoverinc-frankmatacommented, Sep 3, 2020

I can confirm this problem, though I noticed it in a different manner. I tested with a file similar to above, but used “transport_params=dict(buffer_size=1024)” to force it to stream in parts. I have 4 cases to share with you. Testing with gcs files named file.txt and file.txt.gz, and ignore_ext=True and False. Data was uploaded to gcs

  • file.txt with ignore_ext=True, file.txt with ignore_ext=False, file.txt.gz with ignore_ext=True Reads first buffered chunk ok, subsqeuent chunks throw “urllib3.exceptions.DecodeError: (‘Received response with content-encoding: gzip, but failed to decode it.’, error(‘Error -3 while decompressing data: incorrect header check’))”

  • file.txt.gz with ignore_ext=False raises “OSError: Not a gzipped file (b’FI’)” immediately

I modified the smart open code to add the “raw_download=True” to the download_as_string() call, and here are the results.

  • file.txt with ignore_ext=True, file.txt with ignore_ext=False, file.txt.gz with ignore_ext=True File streams from open() object as a gzipped binary. Size of the data matches the compressed size on gcs. Able to decompress this data successfully using gzip.decompress(data), and the data matched exactly.

  • file.txt.gz with ignore_ext=False File streams from open() object as uncompressed data. File size matches the RAW size before upload. Data matches the raw uncompressed data exacly.

Hope this helps y’all to find a good solution. Included below is the quick and dirty script I used to demo this.

set -e
GCP_BUCKET=???
echo "START" > rand.txt
cat /dev/urandom | gtr -dc A-Za-z0-9 | head -c 40000 >> rand.txt
echo "END" >> rand.txt
ls -al rand.txt
gsutil cp -Z rand.txt ${GCP_BUCKET}/rand.txt
gsutil cp -Z rand.txt ${GCP_BUCKET}/rand.txt.gz
gsutil ls -l ${GCP_BUCKET}/rand.txt ${GCP_BUCKET}/rand.txt.gz

for f in ${GCP_BUCKET}/rand.txt ${GCP_BUCKET}/rand.txt.gz ; do
  for ignore_ext in True False; do
    python -c "
from smart_open import open
import gzip
print('\n\n--------${f} ignore_ext=${ignore_ext}----------')
with open('${f}', 'rb', ignore_ext=${ignore_ext}, transport_params=dict(buffer_size=1024)) as f:
  data = f.read()
  print(len(data))
  print(data[:5])
  print(data[-4:])
  raw = gzip.decompress(data)
  print(len(raw))
  print(raw[:5])
  print(raw[-4:])
print('Done')
    " || echo "Function failed"
  done
done
0reactions
petedannemanncommented, Mar 10, 2020

Ok @gdmachado’s suggestion is probably what we want to do then. If Blob.content_encoding == 'gzip' and file_extension != '.gz' then we save state so smart_open knows the file was transcoded to gzip and download the raw compressed data which will have a size true to Blob.size. I think smart_open.gcs.SeekableBufferedInputBase.read will have to use this state to know to decompress the data before it is returned to the user.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Transcoding of gzip-compressed files | Cloud Storage
Cloud Storage supports the decompressive form of transcoding. Note: Decompressive transcoding invalidates integrity checking on affected objects. This is ...
Read more >
How to prevent GCS from automatically decompressing ...
The solution is to use raw_download=True to download the raw gzip archive to prevent decompressive transcoding from happening. Example:
Read more >
GCS Pipeline does not read the data - SingleStore
I gave all the permissions to the service account from google cloud storage but I'm still getting Cannot extract data for pipeline.
Read more >
Google Cloud Storage
... in Google Cloud Storage. For details, see https://github.com/banzaicloud/fluent-plugin-gcs. ... Enable the decompressive form of transcoding. Default: - ...
Read more >
Cloud Storage Go Reference
Writer interfaces to read and write object data: obj := bkt. ... Selecting a specific generation of an object is not currently supported...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found