Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Documentation for `set_index(col, compute=True)` is unclear/inaccurate

See original GitHub issue

I think the documentation is currently unclear/inaccurate about the nature of the compute parameter for set_index:

df.set_index(col, compute=True)

The documentation currently contains this description:

compute: bool, default False

Whether or not to trigger an immediate computation. Defaults to False. Note, that even if you set compute=False, an immediate computation will still be triggered if divisions is None.

This would suggest that if I provide divisions and set compute=True, immediate computation will be triggered. This only seems to be the case when using shuffle=disk, however. Even then, it’s not clear to me what is actually being computed.

Examples from the SO question where I originally asked about this (What does set_index(col, compute=True) do in Dask?):

import dask.datasets
df = dask.datasets.timeseries()

# Nothing gets submitted to the scheduler
df.set_index(
    'name', 
    divisions=('Alice', 'Michael', 'Zelda'), 
    compute=True
)

Going down the stack of functions set_index actually calls, it appears that the only place where compute is actually used in rearrange_by_column_disk. And indeed:

# Still, nothing gets submitted
df.set_index(
    'name', 
    divisions=('Alice', 'Michael', 'Zelda'), 
    shuffle='tasks',
    compute=True
) 

# Something is computed here
df.set_index(
    'name', 
    divisions=('Alice', 'Michael', 'Zelda'), 
    shuffle='disk',
    compute=True
)

If I’m correct, then I believe the documentation should reflect the fact that this setting only affects the shuffle=disk case. Also, I can’t work out from the documentation what is actually being computed — “immediate computation” of what?

Issue Analytics

State:
Created 2 years ago
Comments:13 (7 by maintainers)

Top GitHub Comments

1reaction

scharlottej13commented, Nov 23, 2021

Thanks for raising this issue and for the examples-- I was able to reproduce what you’re describing and was actually about to open an issue after reading your SO post. @ian-r-rose and I had a good discussion on this, and it does seem that at the very least the documentation should be updated to reflect this behavior, but it’s possible this could be a bug. @gjoseph92 maybe you have thoughts on this?

0reactions

jsignellcommented, Dec 15, 2021

Thanks for the ping, I am all for deprecating surprising behaviors and eager computations. @DahnJ did you open an issue for adding a disk option to persist? That does seem like an interesting idea. I wonder if it could be supported that the workers could just pick up the pieces of data that they know about even if they don’t share a disk.