[QST]: p2p shuffle on large datasets
See original GitHub issueI’m attempting to use to p2p shuffle implementation (using the branch proposed for merge in #7326) to shuffle an ~1TB dataset. The data exists on disk as ~300 parquet files (that each expand to around [edit 2GiB] in size, with 23 columns) and I’m trying to shuffle into around 300 output partitions and writing to parquet. The key column is a string (although I can convert to int or datetime if that would help), the other columns are a mix of string, int, and float.
This is on a machine with 1TB RAM, and 40 cores. I run like so:
from pathlib import Path
import dask.dataframe as dd
from distributed import Client, LocalCluster
if __name__ == "__main__":
cluster = LocalCluster(n_workers=40)
client = Client(cluster)
inputdir = Path(".../input")
outputdir = Path(".../output-shuffled/")
ddf = dd.read_parquet(inputdir, split_row_groups=False)
ddf = ddf.shuffle('key', shuffle="p2p")
ddf.to_parquet(outputdir / "store_sales")
This progresses quite well for a while, with peak memory usage hitting ~600GB, at some point though, some workers reach 95% their memory limits and are then killed by the nanny.
Am I configuring things wrong? Do I need to switch on anything else? Or should I not be expecting this to work right now?
Issue Analytics
- State:
- Created 9 months ago
- Comments:21 (21 by maintainers)
Top GitHub Comments
Unfortunately not, AFAIK.
To follow up here, I was able to get the following script to run to completion:
This was on a machine with 40 physical cores and 1TB of RAM.
I needed to set:
(Probably they didn’t need to be that high, but belt-and-braces)
I also need to overcommit the memory limit for each worker to 100GiB.
The reason for this, and the previous failures, is that this dataset has a very skewed distribution for the shuffle key. In particular, there is a single key value that corresponds to around 5% of the total rows (this leads to one worker peaking at 80GiB memory usage when performing the
len
calculation, where all others sit comfortably around 4GiB).The dataset has 2879987999 total rows, and the largest output partition has 132603535 rows.
In this particular instance, I know that downstream I don’t need to do a merge of the dataset on this key (it’s just a pre-sorting step), and so with the prior of the skewed key distribution I could write code to manually construct a better partitioning key. I wonder to what extent that might be automated. One could imagine extending the interface to allow the user to provide a prior on the key distribution that allows the shuffling mechanism to make sensible decisions.
In any case, having figured out the issues, I can, if it is interesting, construct a synthetic datasets that would allow you to test things too (I think one can also replicate the problem at a smaller scale by just doing the same thing but having tighter worker limits).
Complete log (not fully error/warning-free)