question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to set index and devisions lazily using DataFrame.read_csv?

See original GitHub issue

The DataFrame.read_csv method currently does not accept an index_col argument. Instead, the error message suggests to use dd.read_csv(...).set_index('my-index') instead. However, this is an expensive calculation, which should not be necessary if the index is already sorted and the divisions are known.

Is there a way to explicitly set index and divisions using the read_csv method, that does not require to load in data?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
abastcommented, Dec 14, 2016

Perfect! However, I do not only want to set the divisions, but the index as well. Overall, I am struggling for quite a while now with the very basic task to save a dask dataframe to disk in parallel and reload it lazily:

  • setting an index using read_csv is expensive (even with sorted = True)
  • hdf apparently does not support very wide dataframes
  • the to_hdf method requires some special attention to be able to save in parallel
  • fastparquet will be a supreme solution but currently has the disclaimer of beeing not battle tested.

I therefore figured, others might be in the same situation that they would like to provide as much metadata as possible to the read_csv function and wanted to suggest to extend the API of this method in a way, that known divisions along a specified index column can be provided.

1reaction
mrocklincommented, Dec 14, 2016

If your column is sorted then you can use df.set_index(column, sorted=True). This will still involve a full pass through the data to find locations, but will not require an on-disk shuffle.

You might also consider moving to a format like Parquet, which allows for index information to be written as metadata. This is in master in dd.io.parquet but not yet released. http://fastparquet.readthedocs.io/en/latest/details.html#connection-to-dask

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas.read_csv — pandas 1.5.2 documentation
To parse an index or column with a mixture of timezones, specify date_parser to be a partially-applied pandas.to_datetime() with utc=True . See Parsing...
Read more >
Process dask dataframe by chunks of rows - Stack Overflow
You can repartition the dataframe along a division which defines how index values should be allocated across partitions (assuming unique ...
Read more >
Dask Dataframes — Python tools for Big data - Pierre Navaro
Divisions and the Index# ... The Pandas index associates a value to each record/row of your data. Operations that align with the index,...
Read more >
Dask DataFrame - parallelized pandas
If divisions are not known (for instance if the index is not sorted) then you will get None as the division. The “Distance”...
Read more >
Pandas read_csv() - How to read a csv file in Python
Set any column(s) as Index ... By default, Pandas adds an initial index to the data frame loaded from the CSV file. You...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found