How to set index and devisions lazily using DataFrame.read_csv?
See original GitHub issueThe DataFrame.read_csv method currently does not accept an index_col
argument. Instead, the error message suggests to use dd.read_csv(...).set_index('my-index')
instead. However, this is an expensive calculation, which should not be necessary if the index is already sorted and the divisions are known.
Is there a way to explicitly set index and divisions using the read_csv method, that does not require to load in data?
Issue Analytics
- State:
- Created 7 years ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
pandas.read_csv — pandas 1.5.2 documentation
To parse an index or column with a mixture of timezones, specify date_parser to be a partially-applied pandas.to_datetime() with utc=True . See Parsing...
Read more >Process dask dataframe by chunks of rows - Stack Overflow
You can repartition the dataframe along a division which defines how index values should be allocated across partitions (assuming unique ...
Read more >Dask Dataframes — Python tools for Big data - Pierre Navaro
Divisions and the Index# ... The Pandas index associates a value to each record/row of your data. Operations that align with the index,...
Read more >Dask DataFrame - parallelized pandas
If divisions are not known (for instance if the index is not sorted) then you will get None as the division. The “Distance”...
Read more >Pandas read_csv() - How to read a csv file in Python
Set any column(s) as Index ... By default, Pandas adds an initial index to the data frame loaded from the CSV file. You...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Perfect! However, I do not only want to set the divisions, but the index as well. Overall, I am struggling for quite a while now with the very basic task to save a dask dataframe to disk in parallel and reload it lazily:
I therefore figured, others might be in the same situation that they would like to provide as much metadata as possible to the read_csv function and wanted to suggest to extend the API of this method in a way, that known divisions along a specified index column can be provided.
If your column is sorted then you can use
df.set_index(column, sorted=True)
. This will still involve a full pass through the data to find locations, but will not require an on-disk shuffle.You might also consider moving to a format like Parquet, which allows for index information to be written as metadata. This is in master in dd.io.parquet but not yet released. http://fastparquet.readthedocs.io/en/latest/details.html#connection-to-dask