Automatic blocksize for dask.dataframe.read_csv
See original GitHub issueWe could automatically tune the default blocksize based on total physical or available memory and the number of cores. This would avoid memory errors for smaller machines and novice users.
This would probably depend on the optional presence of psutil
psutil.virtual_memory()
psutil.cpu_count()
Issue Analytics
- State:
- Created 7 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
Reading CSV files into Dask DataFrames with read_csv
Dask read_csv : blocksize. The number of partitions depends on the value of the blocksize argument. If you don't supply a value to...
Read more >dask.dataframe.read_csv
Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using ......
Read more >Python dask.dataframe.read_csv() Examples
This page shows Python examples of dask.dataframe.read_csv. ... self.df = dd.read_csv(file_path, sep=sep, header=header, blocksize=block_size) ...
Read more >How to read a compressed (gz) CSV file into a dask ...
csv file that is compressed via gz into a dask dataframe? I've tried it directly with import dask.dataframe as dd df = dd.read_csv("Data....
Read more >csv.py
{reader}('largefile.csv', blocksize=25e6) # 25MB chunks # doctest: +SKIP 3. You can read CSV files from external resources (e.g. S3, HDFS) providing a URL: ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That, probably divided by a further factor of 10 or so. Also, I can’t imagine a case where we would want a blocksize greater than 64MB.
We want to avoid filling ram. It’s ok to be way less than total memory (factor of ten is fine) and it’s terrible to be greater than total memory (things run 100x slower). A block of data on disk might very easily expand to 5x its original size when converted from CSV to a dataframe. We will have about as many blocks in ram as we have cores.
This is fixed by #1328 , let’s close. 😃