question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot do chunked reads on compressed files. To read, set blocksize=None

See original GitHub issue

I tried to read data from compressed file to dask bag as follows:

bag = db.read_text('[path/to/gzip/file/data.gz]', blocksize="2.5GB").str.strip().str.split('\t')

but I got error: Cannot do chunked reads on compressed files. To read, set blocksize=None

If I changed blocksize to be None as suggested, the bag only had 1 partition, which was dumped to 1 worker for further computation no matter how many workers were available.

I also tried bag.repartition(npartitions=6), but all works were still assigned to 1 worker (even though there were 2 workers available).

How to read data from one gzip file and have multiple partitions which can be assigned to multiple workers for computation?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Jun 8, 2021

Technically you could probably use lzma / xz files. But that’s somewhat exotic. Pile-of-gzip for the win.

On Tue, Jun 8, 2021 at 2:34 PM Fred Li @.***> wrote:

yes, the gzip file contains tsv format files that cannot be changed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/7758#issuecomment-857050065, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTHDROXZSVKKAJLDA4TTRZWDTANCNFSM46ADYURA .

0reactions
fredmscommented, Jun 8, 2021

yes, the gzip file contains tsv format files that cannot be changed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Can dask read compressed files in blocks? - Stack Overflow
Is Dask able to read compressed files in chunks? I receive a couple of errors in this notebook when reading a .xz file,...
Read more >
Issue when reading remote CSV #6046 - dask/dask - GitHub
Backing filesystem couldn't determine file size, cannot do chunked reads. To read, set blocksize=None. But using blocksize=None in Dask it's ...
Read more >
dask.bytes.core - Dask documentation
... else: comp = compression if comp is not None: raise ValueError( "Cannot do chunked reads on compressed files. " "To read, set...
Read more >
How to read ZIP files in R - Roel Peters
Reading .zip or tar.gz file can improve the speed of your workflow in ... When data sets are ping-ponged across an organization, in...
Read more >
Reading CSV files into Dask DataFrames with read_csv
Here's how to read the CSV file into a Dask DataFrame. import dask.dataframe as dd ddf = dd.read_csv("dogs.csv"). You can inspect the content...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found