Compression not handled for GCS files
See original GitHub issueProblem description
When reading gzip-encoded blob from GCS, file object returns compressed binary data instead of decompressed text.
Steps/code to reproduce the problem
In [30]: smart_open.open('gs://test/file.json.gz').read()
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-30-4e7f4d55e81d> in <module>
----> 1 smart_open.open('gs://test/file.json.gz').read()
/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py in read(self, size, chars, firstline)
502 break
503 try:
--> 504 newchars, decodedbytes = self.decode(data, self.errors)
505 except UnicodeDecodeError as exc:
506 if firstline:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Also, I was able to find root cause - gcs.Reader
class not implements .name
attribute which cause compression.compression_wrapper
function to skip decompressing (because os.path.splitext(file_obj.name)
returns ('unknown', '')
).
Versions
Darwin-19.3.0-x86_64-i386-64bit Python 3.7.7 (default, Mar 10 2020, 15:43:33) [Clang 11.0.0 (clang-1100.0.33.17)] smart_open 2.0.0
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Transcoding of gzip-compressed files | Cloud Storage
This page discusses the conversion of files to and from a gzip-compressed state. The page includes an overview of transcoding, best practices for...
Read more >Compress files saved in Google cloud storage - Stack Overflow
The files are created and populated by Google dataflow code. Dataflow cannot write to a compressed file but my requirement is to save...
Read more >Compression of data during transfer to Google Cloud Coldline ...
gsutil rsync does not support compression. Share.
Read more >Serverless Compression of GCS Data With Streaming Golang ...
GZIPing of large files 1GB+ seems to require more grunt to be able to complete compression within 9-minute slot. The solution is simple,...
Read more >Handling Zip Files in GCS(Conflict) : r/dataengineering - Reddit
Yes but data trafers dont support zip compression and handle gzip instead. That team was using cloud runs to read the gile and...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
https://pypi.org/project/smart-open/2.1.0/
Hi @gelioz, thank you for reporting this.
We have already fixed the issue in the develop branch (see https://github.com/RaRe-Technologies/smart_open/pull/506). Are you able to confirm?