Cannot do chunked reads on compressed files. To read, set blocksize=None
See original GitHub issueI tried to read data from compressed file to dask bag as follows:
bag = db.read_text('[path/to/gzip/file/data.gz]', blocksize="2.5GB").str.strip().str.split('\t')
but I got error:
Cannot do chunked reads on compressed files. To read, set blocksize=None
If I changed blocksize to be None as suggested, the bag only had 1 partition, which was dumped to 1 worker for further computation no matter how many workers were available.
I also tried bag.repartition(npartitions=6), but all works were still assigned to 1 worker (even though there were 2 workers available).
How to read data from one gzip file and have multiple partitions which can be assigned to multiple workers for computation?
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Can dask read compressed files in blocks? - Stack Overflow
Is Dask able to read compressed files in chunks? I receive a couple of errors in this notebook when reading a .xz file,...
Read more >Issue when reading remote CSV #6046 - dask/dask - GitHub
Backing filesystem couldn't determine file size, cannot do chunked reads. To read, set blocksize=None. But using blocksize=None in Dask it's ...
Read more >dask.bytes.core - Dask documentation
... else: comp = compression if comp is not None: raise ValueError( "Cannot do chunked reads on compressed files. " "To read, set...
Read more >How to read ZIP files in R - Roel Peters
Reading .zip or tar.gz file can improve the speed of your workflow in ... When data sets are ping-ponged across an organization, in...
Read more >Reading CSV files into Dask DataFrames with read_csv
Here's how to read the CSV file into a Dask DataFrame. import dask.dataframe as dd ddf = dd.read_csv("dogs.csv"). You can inspect the content...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Technically you could probably use lzma / xz files. But that’s somewhat exotic. Pile-of-gzip for the win.
On Tue, Jun 8, 2021 at 2:34 PM Fred Li @.***> wrote:
yes, the gzip file contains tsv format files that cannot be changed.