question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data loss with `DataFrame.set_index(.., shuffle="disk")`

See original GitHub issue

What happened: DataFrame.set_index(..., shuffle="disk") is loosing significant amount of data when multiple workers are used.

I.e. length of result dataframe is much smaller than length of initial dataframe.

What you expected to happen:

Length of dataframes before and after of set_index to be the same.

Minimal Complete Verifiable Example:

cluster = dask_kubernetes.KubeCluster.from_yaml('worker.yaml', name=f'{os.getenv("HOSTNAME")}-dask', n_workers=6)
client = Client(cluster)

test_ddf = dd.from_pandas(pd.DataFrame({
    'uuid': [str(uuid.uuid4()) for i in range(10000)],
}), chunksize=100)

len(test_ddf)
# 10000

len(test_ddf.set_index('uuid'))
# 10000

len(test_ddf.set_index('uuid', shuffle='disk'))
# 1669

Anything else we need to know?:

Environment:

  • Dask version: 2.26.0
  • Python version: 3.8.5.final.0
  • Operating System: Linux
  • Install method (conda, pip, source): conda

{‘host’: {‘python’: ‘3.8.5.final.0’, ‘python-bits’: 64, ‘OS’: ‘Linux’, ‘OS-release’: ‘5.4.0-51-generic’, ‘machine’: ‘x86_64’, ‘processor’: ‘x86_64’, ‘byteorder’: ‘little’, ‘LC_ALL’: ‘None’, ‘LANG’: ‘None’}, ‘packages’: {‘python’: ‘3.8.5.final.0’, ‘dask’: ‘2.26.0’, ‘distributed’: ‘2.26.0’, ‘msgpack’: ‘1.0.0’, ‘cloudpickle’: ‘1.6.0’, ‘tornado’: ‘6.0.4’, ‘toolz’: ‘0.10.0’, ‘numpy’: ‘1.19.1’, ‘lz4’: ‘3.1.0’, ‘blosc’: None}}

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:14 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jrbourbeaucommented, Dec 22, 2020

Thanks @elephantum, either I or another Dask maintainer will investigate. Just FYI folks who would normally look into this are beginning to take off for end-of-year holidays, so it may be a couple of weeks before someone is able to take a look

0reactions
quasibencommented, Jan 26, 2021

Also: if shuffle=disk is not intended for multi-worker cluster environment it would be great at least to fail with some explicit message. Currently dask silently loses data, I don’t believe it can be an acceptable behaviour.

I agree that silently losing data is not great and that should be fixed. shuffle=disk can be multi-worker but single-machine (xref: https://docs.dask.org/en/latest/dataframe-groupby.html#shuffle-methods).

In your case, running on k8s is not ideal for disk shuffling. Instead I would ask to dig into the problems you had with regular task based shuffling. Dask does have advanced spilling techniques which which can be tuned depending worker memory resources. So, a) and b) above are covered by dask/distributed but may require some tuning on the memory allocated per worker and when workers should spill

Read more comments on GitHub >

github_iconTop Results From Across the Web

Data loss with dask.dataframe.set_index(…, shuffle='disk')
I get a result which to me is unexpected. I get substantial data loss (around half data gone) when using set_index(…, shuffle='disk') on...
Read more >
Indexing and selecting data — pandas 1.5.2 documentation
In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas...
Read more >
Pandas-Shuffling, Grouping and Sorting . | by Sanjay.M | AIKISS
Pandas functions like Reset-Index, sort, group and shuffle explained with ... Sorting the data set allows you to order the rows in either ......
Read more >
How to Shuffle Pandas Dataframe Rows in Python - Datagy
Because our data is often sorted in a particular way (say, for example, by date or by geographical area), we want to make...
Read more >
Configuration - Spark 3.3.1 Documentation - Apache Spark
Spark properties control most application parameters and can be set by ... shuffle data on executors that are deallocated will remain on disk...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found