Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to set index and devisions lazily using DataFrame.read_csv?

See original GitHub issue

The DataFrame.read_csv method currently does not accept an index_col argument. Instead, the error message suggests to use dd.read_csv(...).set_index('my-index') instead. However, this is an expensive calculation, which should not be necessary if the index is already sorted and the divisions are known.

Is there a way to explicitly set index and divisions using the read_csv method, that does not require to load in data?

Issue Analytics

State:
Created 7 years ago
Comments:9 (4 by maintainers)

Top GitHub Comments

1reaction

abastcommented, Dec 14, 2016

Perfect! However, I do not only want to set the divisions, but the index as well. Overall, I am struggling for quite a while now with the very basic task to save a dask dataframe to disk in parallel and reload it lazily:

setting an index using read_csv is expensive (even with sorted = True)
hdf apparently does not support very wide dataframes
the to_hdf method requires some special attention to be able to save in parallel
fastparquet will be a supreme solution but currently has the disclaimer of beeing not battle tested.

I therefore figured, others might be in the same situation that they would like to provide as much metadata as possible to the read_csv function and wanted to suggest to extend the API of this method in a way, that known divisions along a specified index column can be provided.

1reaction

mrocklincommented, Dec 14, 2016

If your column is sorted then you can use df.set_index(column, sorted=True). This will still involve a full pass through the data to find locations, but will not require an on-disk shuffle.

You might also consider moving to a format like Parquet, which allows for index information to be written as metadata. This is in master in dd.io.parquet but not yet released. http://fastparquet.readthedocs.io/en/latest/details.html#connection-to-dask

Top Results From Across the Web

pandas.read_csv — pandas 1.5.2 documentation

To parse an index or column with a mixture of timezones, specify date_parser to be a partially-applied pandas.to_datetime() with utc=True . See Parsing...

Process dask dataframe by chunks of rows - Stack Overflow

You can repartition the dataframe along a division which defines how index values should be allocated across partitions (assuming unique ...

Dask Dataframes — Python tools for Big data - Pierre Navaro

Divisions and the Index# ... The Pandas index associates a value to each record/row of your data. Operations that align with the index,...

Dask DataFrame - parallelized pandas

If divisions are not known (for instance if the index is not sorted) then you will get None as the division. The “Distance”...

Pandas read_csv() - How to read a csv file in Python

Set any column(s) as Index ... By default, Pandas adds an initial index to the data frame loaded from the CSV file. You...