Data loss with `DataFrame.set_index(.., shuffle="disk")`
See original GitHub issueWhat happened:
DataFrame.set_index(..., shuffle="disk")
is loosing significant amount of data when multiple workers are used.
I.e. length of result dataframe is much smaller than length of initial dataframe.
What you expected to happen:
Length of dataframes before and after of set_index
to be the same.
Minimal Complete Verifiable Example:
cluster = dask_kubernetes.KubeCluster.from_yaml('worker.yaml', name=f'{os.getenv("HOSTNAME")}-dask', n_workers=6)
client = Client(cluster)
test_ddf = dd.from_pandas(pd.DataFrame({
'uuid': [str(uuid.uuid4()) for i in range(10000)],
}), chunksize=100)
len(test_ddf)
# 10000
len(test_ddf.set_index('uuid'))
# 10000
len(test_ddf.set_index('uuid', shuffle='disk'))
# 1669
Anything else we need to know?:
Environment:
- Dask version: 2.26.0
- Python version: 3.8.5.final.0
- Operating System: Linux
- Install method (conda, pip, source): conda
{‘host’: {‘python’: ‘3.8.5.final.0’, ‘python-bits’: 64, ‘OS’: ‘Linux’, ‘OS-release’: ‘5.4.0-51-generic’, ‘machine’: ‘x86_64’, ‘processor’: ‘x86_64’, ‘byteorder’: ‘little’, ‘LC_ALL’: ‘None’, ‘LANG’: ‘None’}, ‘packages’: {‘python’: ‘3.8.5.final.0’, ‘dask’: ‘2.26.0’, ‘distributed’: ‘2.26.0’, ‘msgpack’: ‘1.0.0’, ‘cloudpickle’: ‘1.6.0’, ‘tornado’: ‘6.0.4’, ‘toolz’: ‘0.10.0’, ‘numpy’: ‘1.19.1’, ‘lz4’: ‘3.1.0’, ‘blosc’: None}}
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:14 (5 by maintainers)
Top GitHub Comments
Thanks @elephantum, either I or another Dask maintainer will investigate. Just FYI folks who would normally look into this are beginning to take off for end-of-year holidays, so it may be a couple of weeks before someone is able to take a look
I agree that silently losing data is not great and that should be fixed.
shuffle=disk
can be multi-worker but single-machine (xref: https://docs.dask.org/en/latest/dataframe-groupby.html#shuffle-methods).In your case, running on k8s is not ideal for disk shuffling. Instead I would ask to dig into the problems you had with regular task based shuffling. Dask does have advanced spilling techniques which which can be tuned depending worker memory resources. So,
a)
andb)
above are covered by dask/distributed but may require some tuning on the memory allocated per worker and when workers should spill