question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Compression not handled for GCS files

See original GitHub issue

Problem description

When reading gzip-encoded blob from GCS, file object returns compressed binary data instead of decompressed text.

Steps/code to reproduce the problem

In [30]: smart_open.open('gs://test/file.json.gz').read()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-30-4e7f4d55e81d> in <module>
----> 1 smart_open.open('gs://test/file.json.gz').read()

/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py in read(self, size, chars, firstline)
    502                 break
    503             try:
--> 504                 newchars, decodedbytes = self.decode(data, self.errors)
    505             except UnicodeDecodeError as exc:
    506                 if firstline:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Also, I was able to find root cause - gcs.Reader class not implements .name attribute which cause compression.compression_wrapper function to skip decompressing (because os.path.splitext(file_obj.name) returns ('unknown', '')).

Versions

Darwin-19.3.0-x86_64-i386-64bit Python 3.7.7 (default, Mar 10 2020, 15:43:33) [Clang 11.0.0 (clang-1100.0.33.17)] smart_open 2.0.0

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
mpenkovcommented, Jul 1, 2020
1reaction
mpenkovcommented, Jun 28, 2020

Hi @gelioz, thank you for reporting this.

We have already fixed the issue in the develop branch (see https://github.com/RaRe-Technologies/smart_open/pull/506). Are you able to confirm?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Transcoding of gzip-compressed files | Cloud Storage
This page discusses the conversion of files to and from a gzip-compressed state. The page includes an overview of transcoding, best practices for...
Read more >
Compress files saved in Google cloud storage - Stack Overflow
The files are created and populated by Google dataflow code. Dataflow cannot write to a compressed file but my requirement is to save...
Read more >
Compression of data during transfer to Google Cloud Coldline ...
gsutil rsync does not support compression. Share.
Read more >
Serverless Compression of GCS Data With Streaming Golang ...
GZIPing of large files 1GB+ seems to require more grunt to be able to complete compression within 9-minute slot. The solution is simple,...
Read more >
Handling Zip Files in GCS(Conflict) : r/dataengineering - Reddit
Yes but data trafers dont support zip compression and handle gzip instead. That team was using cloud runs to read the gile and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found