GCS decompressive transcoding not supported when reading
See original GitHub issueProblem description
Using smart_open
(unreleased version with GCS support) to download files from GCS with transparent decompressive transcoding enabled may lead to incomplete files being downloaded depending on the compressed file size.
With Google Cloud Storage there is the option to store gzip-compressed files & use decompressive transcoding to transparently decompress these when downloading. Decompression is thenhandled by Google servers. In this case, the filename wouldn’t have any compression extension (eg. file.csv
), however when inspecting it’s metadata, it would contain something like that:
{
"Content-Type": "text/csv; charset=utf-8",
"Content-Encoding": "gzip"
}
This would be fine if it weren’t for the fact that in such cases, Blob()._size
will return the compressed size. Since smart_open
uses this to understand when to stop reading, it results in incomplete files.
Steps/code to reproduce the problem
write ~400KB file (larger than smart_opens’s default buffer size)
$ cat /dev/urandom | gtr -dc A-Za-z0-9 | head -c 400000 > rand.txt
$ ls -l
total 1024
-rw-r--r-- 1 gustavomachado staff 524288 Feb 17 12:34 rand.txt
upload file to GCS
$ gsutil cp -Z ./rand.txt gs://my-bucket/
Copying file://./rand.txt [Content-Type=text/plain]...
- [1 files][293.8 KiB/293.8 KiB]
Operation completed over 1 objects/293.8 KiB.
resulting (compressed) file is 293.8 KiB.
check file metadata
$ gsutil stat gs://my-bucket/rand.txt
gs://my-bucket/rand.txt:
Creation time: Mon, 17 Feb 2020 13:45:36 GMT
Update time: Mon, 17 Feb 2020 13:45:36 GMT
Storage class: MULTI_REGIONAL
Cache-Control: no-transform
Content-Encoding: gzip
Content-Language: en
Content-Length: 300842
Content-Type: text/plain
Hash (crc32c): Ko+ooA==
Hash (md5): 8C6OlwZIR+fgRMy2xmQqLw==
ETag: CNWW+Kjc2OcCEAE=
Generation: 1581947136379733
Metageneration: 1
download file using smart_open
(gcloud credentials already set)
>>> from smart_open import open
>>> with open('gs://my-bucket/rand.txt', 'r') as fin:
... with open('downloaded.txt', 'w') as fout:
... for line in fin:
... fout.write(line)
...
348550
check resulting file size
$ ls -l
total 1472
-rw-r--r-- 1 gustavomachado staff 348550 Feb 17 14:48 downloaded.txt
-rw-r--r-- 1 gustavomachado staff 400000 Feb 17 14:45 rand.txt
original file is 400KB however downloaded file is 348KB. not sure why it’s still bigger than the 300842
reported by Google, though.
Versions
Please provide the output of:
>>> import platform, sys, smart_open
>>> print(platform.platform())
Darwin-18.7.0-x86_64-i386-64bit
>>> print("Python", sys.version)
Python 3.7.2 (default, Dec 9 2019, 14:10:57)
[Clang 10.0.1 (clang-1001.0.46.4)]
>>> print("smart_open", smart_open.__version__)
smart_open 1.9.0
smart_open
has been pinned to 72818ca, installed with
$ pip install git+git://github.com/RaRe-Technologies/smart_open.git@72818ca6d3a0a99e1717ab31db72bf109ac5ce65
Checklist
Before you create the issue, please make sure you have:
- Described the problem clearly
- Provided a minimal reproducible example, including any required data
- Provided the version numbers of the relevant software
Possible solutions
Setting buffer_size
to a value larger than the compressed file size will of course download it in it’s entirety, but for large files that would mean loading the entire file into memory.
A reasonable option would be to check Blob().content_type
, and if it is equal to 'gzip'
, call Blob().download_as_string
with raw_download=True
, and then somehow handle decompression internally with the already-existing decompression mechanisms
If the maintainers agree this would be a viable solution, I’ll be happy to provide a PR implementing it.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:5 (3 by maintainers)
I can confirm this problem, though I noticed it in a different manner. I tested with a file similar to above, but used “transport_params=dict(buffer_size=1024)” to force it to stream in parts. I have 4 cases to share with you. Testing with gcs files named file.txt and file.txt.gz, and ignore_ext=True and False. Data was uploaded to gcs
file.txt with ignore_ext=True, file.txt with ignore_ext=False, file.txt.gz with ignore_ext=True Reads first buffered chunk ok, subsqeuent chunks throw “urllib3.exceptions.DecodeError: (‘Received response with content-encoding: gzip, but failed to decode it.’, error(‘Error -3 while decompressing data: incorrect header check’))”
file.txt.gz with ignore_ext=False raises “OSError: Not a gzipped file (b’FI’)” immediately
I modified the smart open code to add the “raw_download=True” to the download_as_string() call, and here are the results.
file.txt with ignore_ext=True, file.txt with ignore_ext=False, file.txt.gz with ignore_ext=True File streams from open() object as a gzipped binary. Size of the data matches the compressed size on gcs. Able to decompress this data successfully using gzip.decompress(data), and the data matched exactly.
file.txt.gz with ignore_ext=False File streams from open() object as uncompressed data. File size matches the RAW size before upload. Data matches the raw uncompressed data exacly.
Hope this helps y’all to find a good solution. Included below is the quick and dirty script I used to demo this.
Ok @gdmachado’s suggestion is probably what we want to do then. If
Blob.content_encoding == 'gzip' and file_extension != '.gz'
then we save state so smart_open knows the file was transcoded to gzip and download the raw compressed data which will have a size true toBlob.size
. I thinksmart_open.gcs.SeekableBufferedInputBase.read
will have to use this state to know to decompress the data before it is returned to the user.