question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Documentation for `set_index(col, compute=True)` is unclear/inaccurate

See original GitHub issue

I think the documentation is currently unclear/inaccurate about the nature of the compute parameter for set_index:

df.set_index(col, compute=True)

The documentation currently contains this description:

compute: bool, default False

  • Whether or not to trigger an immediate computation. Defaults to False. Note, that even if you set compute=False, an immediate computation will still be triggered if divisions is None.

This would suggest that if I provide divisions and set compute=True, immediate computation will be triggered. This only seems to be the case when using shuffle=disk, however. Even then, it’s not clear to me what is actually being computed.

Examples from the SO question where I originally asked about this (What does set_index(col, compute=True) do in Dask?):

import dask.datasets
df = dask.datasets.timeseries()

# Nothing gets submitted to the scheduler
df.set_index(
    'name', 
    divisions=('Alice', 'Michael', 'Zelda'), 
    compute=True
) 

Going down the stack of functions set_index actually calls, it appears that the only place where compute is actually used in rearrange_by_column_disk. And indeed:

# Still, nothing gets submitted
df.set_index(
    'name', 
    divisions=('Alice', 'Michael', 'Zelda'), 
    shuffle='tasks',
    compute=True
) 

# Something is computed here
df.set_index(
    'name', 
    divisions=('Alice', 'Michael', 'Zelda'), 
    shuffle='disk',
    compute=True
) 

If I’m correct, then I believe the documentation should reflect the fact that this setting only affects the shuffle=disk case. Also, I can’t work out from the documentation what is actually being computed — “immediate computation” of what?

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:13 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
scharlottej13commented, Nov 23, 2021

Thanks for raising this issue and for the examples-- I was able to reproduce what you’re describing and was actually about to open an issue after reading your SO post. @ian-r-rose and I had a good discussion on this, and it does seem that at the very least the documentation should be updated to reflect this behavior, but it’s possible this could be a bug. @gjoseph92 maybe you have thoughts on this?

0reactions
jsignellcommented, Dec 15, 2021

Thanks for the ping, I am all for deprecating surprising behaviors and eager computations. @DahnJ did you open an issue for adding a disk option to persist? That does seem like an interesting idea. I wonder if it could be supported that the workers could just pick up the pieces of data that they know about even if they don’t share a disk.

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas.DataFrame.set_index — pandas 1.5.2 documentation
This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list...
Read more >
Pandas Set Index to Column in DataFrame
In order to set index to column in pandas DataFrame use reset_index() method. By using this you can also set single, multiple indexes...
Read more >
DataFrame.set_index - Dask documentation
This realigns the dataset to be sorted by a new column. This can have a significant impact on performance, because joins, groupbys, lookups, ......
Read more >
Assign existing column to the DataFrame index with set_index()
DataFrame.set_index — pandas 0.22.0 documentation This article ... Set index when reading CSV file; Select rows and elements using index.
Read more >
Set Index in pandas DataFrame - PYnative
Row label is called an index, whereas column label is called column index/header. By default, while creating DataFrame, Python pandas assign a ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found