Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Automatic blocksize for dask.dataframe.read_csv

See original GitHub issue

We could automatically tune the default blocksize based on total physical or available memory and the number of cores. This would avoid memory errors for smaller machines and novice users.

This would probably depend on the optional presence of psutil

psutil.virtual_memory()
psutil.cpu_count()

Issue Analytics

State:
Created 7 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Jun 18, 2016

That, probably divided by a further factor of 10 or so. Also, I can’t imagine a case where we would want a blocksize greater than 64MB.

We want to avoid filling ram. It’s ok to be way less than total memory (factor of ten is fine) and it’s terrible to be greater than total memory (things run 100x slower). A block of data on disk might very easily expand to 5x its original size when converted from CSV to a dataframe. We will have about as many blocks in ram as we have cores.

0reactions

dukebodycommented, Jul 10, 2016

This is fixed by #1328 , let’s close. 😃

Top Results From Across the Web

Reading CSV files into Dask DataFrames with read_csv

Dask read_csv : blocksize. The number of partitions depends on the value of the blocksize argument. If you don't supply a value to...

dask.dataframe.read_csv

Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using ......

Python dask.dataframe.read_csv() Examples

This page shows Python examples of dask.dataframe.read_csv. ... self.df = dd.read_csv(file_path, sep=sep, header=header, blocksize=block_size) ...

How to read a compressed (gz) CSV file into a dask ...

csv file that is compressed via gz into a dask dataframe? I've tried it directly with import dask.dataframe as dd df = dd.read_csv("Data....

csv.py

{reader}('largefile.csv', blocksize=25e6) # 25MB chunks # doctest: +SKIP 3. You can read CSV files from external resources (e.g. S3, HDFS) providing a URL: ......