Documentation for `set_index(col, compute=True)` is unclear/inaccurate
See original GitHub issueI think the documentation is currently unclear/inaccurate about the nature of the compute
parameter for set_index
:
df.set_index(col, compute=True)
The documentation currently contains this description:
compute: bool, default False
- Whether or not to trigger an immediate computation. Defaults to False. Note, that even if you set compute=False, an immediate computation will still be triggered if divisions is None.
This would suggest that if I provide divisions and set compute=True
, immediate computation will be triggered. This only seems to be the case when using shuffle=disk
, however. Even then, it’s not clear to me what is actually being computed.
Examples from the SO question where I originally asked about this (What does set_index(col, compute=True) do in Dask?):
import dask.datasets
df = dask.datasets.timeseries()
# Nothing gets submitted to the scheduler
df.set_index(
'name',
divisions=('Alice', 'Michael', 'Zelda'),
compute=True
)
Going down the stack of functions set_index
actually calls, it appears that the only place where compute
is actually used in rearrange_by_column_disk
. And indeed:
# Still, nothing gets submitted
df.set_index(
'name',
divisions=('Alice', 'Michael', 'Zelda'),
shuffle='tasks',
compute=True
)
# Something is computed here
df.set_index(
'name',
divisions=('Alice', 'Michael', 'Zelda'),
shuffle='disk',
compute=True
)
If I’m correct, then I believe the documentation should reflect the fact that this setting only affects the shuffle=disk
case. Also, I can’t work out from the documentation what is actually being computed — “immediate computation” of what?
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (7 by maintainers)
Thanks for raising this issue and for the examples-- I was able to reproduce what you’re describing and was actually about to open an issue after reading your SO post. @ian-r-rose and I had a good discussion on this, and it does seem that at the very least the documentation should be updated to reflect this behavior, but it’s possible this could be a bug. @gjoseph92 maybe you have thoughts on this?
Thanks for the ping, I am all for deprecating surprising behaviors and eager computations. @DahnJ did you open an issue for adding a
disk
option topersist
? That does seem like an interesting idea. I wonder if it could be supported that the workers could just pick up the pieces of data that they know about even if they don’t share a disk.