question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Automatic blocksize for dask.dataframe.read_csv

See original GitHub issue

We could automatically tune the default blocksize based on total physical or available memory and the number of cores. This would avoid memory errors for smaller machines and novice users.

This would probably depend on the optional presence of psutil

psutil.virtual_memory()
psutil.cpu_count()

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Jun 18, 2016

That, probably divided by a further factor of 10 or so. Also, I can’t imagine a case where we would want a blocksize greater than 64MB.

We want to avoid filling ram. It’s ok to be way less than total memory (factor of ten is fine) and it’s terrible to be greater than total memory (things run 100x slower). A block of data on disk might very easily expand to 5x its original size when converted from CSV to a dataframe. We will have about as many blocks in ram as we have cores.

0reactions
dukebodycommented, Jul 10, 2016

This is fixed by #1328 , let’s close. 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Reading CSV files into Dask DataFrames with read_csv
Dask read_csv : blocksize. The number of partitions depends on the value of the blocksize argument. If you don't supply a value to...
Read more >
dask.dataframe.read_csv
Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using ......
Read more >
Python dask.dataframe.read_csv() Examples
This page shows Python examples of dask.dataframe.read_csv. ... self.df = dd.read_csv(file_path, sep=sep, header=header, blocksize=block_size) ...
Read more >
How to read a compressed (gz) CSV file into a dask ...
csv file that is compressed via gz into a dask dataframe? I've tried it directly with import dask.dataframe as dd df = dd.read_csv("Data....
Read more >
csv.py
{reader}('largefile.csv', blocksize=25e6) # 25MB chunks # doctest: +SKIP 3. You can read CSV files from external resources (e.g. S3, HDFS) providing a URL: ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found