question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[QST]: p2p shuffle on large datasets

See original GitHub issue

I’m attempting to use to p2p shuffle implementation (using the branch proposed for merge in #7326) to shuffle an ~1TB dataset. The data exists on disk as ~300 parquet files (that each expand to around [edit 2GiB] in size, with 23 columns) and I’m trying to shuffle into around 300 output partitions and writing to parquet. The key column is a string (although I can convert to int or datetime if that would help), the other columns are a mix of string, int, and float.

This is on a machine with 1TB RAM, and 40 cores. I run like so:

from pathlib import Path

import dask.dataframe as dd
from distributed import Client, LocalCluster

if __name__ == "__main__":
    cluster = LocalCluster(n_workers=40)
    client = Client(cluster)
    inputdir = Path(".../input")
    outputdir = Path(".../output-shuffled/")
    ddf = dd.read_parquet(inputdir, split_row_groups=False)

    ddf = ddf.shuffle('key', shuffle="p2p")

    ddf.to_parquet(outputdir / "store_sales")

This progresses quite well for a while, with peak memory usage hitting ~600GB, at some point though, some workers reach 95% their memory limits and are then killed by the nanny.

Am I configuring things wrong? Do I need to switch on anything else? Or should I not be expecting this to work right now?

Issue Analytics

  • State:open
  • Created 9 months ago
  • Comments:21 (21 by maintainers)

github_iconTop GitHub Comments

1reaction
wence-commented, Dec 9, 2022

I assume this is not a public dataset, is it?

Unfortunately not, AFAIK.

0reactions
wence-commented, Dec 16, 2022

To follow up here, I was able to get the following script to run to completion:

This was on a machine with 40 physical cores and 1TB of RAM.

I needed to set:

export DASK_DISTRIBUTED__SCHEDULER__WORKER__TTL=3600s
export DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=3600s
export DASK_DISTRIBUTED__COMM__TIMEOUTS__TCP=3600s

(Probably they didn’t need to be that high, but belt-and-braces)

I also need to overcommit the memory limit for each worker to 100GiB.

The reason for this, and the previous failures, is that this dataset has a very skewed distribution for the shuffle key. In particular, there is a single key value that corresponds to around 5% of the total rows (this leads to one worker peaking at 80GiB memory usage when performing the len calculation, where all others sit comfortably around 4GiB).

The dataset has 2879987999 total rows, and the largest output partition has 132603535 rows.

In this particular instance, I know that downstream I don’t need to do a merge of the dataset on this key (it’s just a pre-sorting step), and so with the prior of the skewed key distribution I could write code to manually construct a better partitioning key. I wonder to what extent that might be automated. One could imagine extending the interface to allow the user to provide a prior on the key distribution that allows the shuffling mechanism to make sensible decisions.

In any case, having figured out the issues, I can, if it is interesting, construct a synthetic datasets that would allow you to test things too (I think one can also replicate the problem at a smaller scale by just doing the same thing but having tighter worker limits).

from pathlib import Path

import dask
import dask.dataframe as dd
import distributed
from distributed import Client, LocalCluster

if __name__ == "__main__":

    print("Dask version:", dask.__version__)
    print("Distributed version:", distributed.__version__)
    cluster = LocalCluster(n_workers=20, memory_limit="100GiB")
    client = Client(cluster)
    inputdir = Path(".../input/")
    outputdir = Path(".../shuffled/")
    ddf = dd.read_parquet(inputdir, split_row_groups=True)
    ddf = ddf.shuffle('shuffle_key', shuffle="p2p")
    final_partition_sizes = ddf.map_partitions(len).compute()
    print(f"Num out partitions = {len(final_partition_sizes)}")
    print(final_partition_sizes.max(), final_partition_sizes.min())
    print(final_partition_sizes)
Complete log (not fully error/warning-free)
Dask version: 2022.12.0
Distributed version: 2022.12.0+95.gbc317e20
.../lib/python3.9/contextlib.py:126: UserWarning: Creating scratch directories is taking a surprisingly long time. (6.11s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
'NoneType' object has no attribute 'add_next_tick_callback'
Traceback (most recent call last):
  File ".../lib/python3.9/site-packages/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File ".../lib/python3.9/site-packages/distributed/dashboard/components/shared.py", line 315, in cb
    self.doc().add_next_tick_callback(lambda: self.update(prof, metadata))
AttributeError: 'NoneType' object has no attribute 'add_next_tick_callback'
2022-12-16 03:54:12,792 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f7afefda070>>, <Task finished name='Task-4397943' coro=<ProfileTimePlot.trigger_update.<locals>.cb() done, defined at .../lib/python3.9/site-packages/distributed/utils.py:740> exception=AttributeError("'NoneType' object has no attribute 'add_next_tick_callback'")>)
Traceback (most recent call last):
  File ".../lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File ".../lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File ".../lib/python3.9/site-packages/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File ".../lib/python3.9/site-packages/distributed/dashboard/components/shared.py", line 315, in cb
    self.doc().add_next_tick_callback(lambda: self.update(prof, metadata))
AttributeError: 'NoneType' object has no attribute 'add_next_tick_callback'
'NoneType' object has no attribute 'add_next_tick_callback'
Traceback (most recent call last):
  File ".../lib/python3.9/site-packages/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File ".../lib/python3.9/site-packages/distributed/dashboard/components/shared.py", line 315, in cb
    self.doc().add_next_tick_callback(lambda: self.update(prof, metadata))
AttributeError: 'NoneType' object has no attribute 'add_next_tick_callback'
2022-12-16 03:54:21,762 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f7afefda070>>, <Task finished name='Task-4412619' coro=<ProfileTimePlot.trigger_update.<locals>.cb() done, defined at .../lib/python3.9/site-packages/distributed/utils.py:740> exception=AttributeError("'NoneType' object has no attribute 'add_next_tick_callback'")>)
Traceback (most recent call last):
  File ".../lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File ".../lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File ".../lib/python3.9/site-packages/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File ".../lib/python3.9/site-packages/distributed/dashboard/components/shared.py", line 315, in cb
    self.doc().add_next_tick_callback(lambda: self.update(prof, metadata))
AttributeError: 'NoneType' object has no attribute 'add_next_tick_callback'
Num out partitions = 4367
132603535 0
0        869862
1       1986682
2        868497
3             0
4             0
         ...   
4362          0
4363          0
4364          0
4365     867148
4366          0
Length: 4367, dtype: int64
2022-12-16 03:55:45,969 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File ".../lib/python3.9/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".../lib/python3.9/site-packages/distributed/worker.py", line 1213, in heartbeat
    response = await retry_operation(
  File ".../lib/python3.9/site-packages/distributed/utils_comm.py", line 400, in retry_operation
    return await retry(
  File ".../lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry
    return await coro()
  File ".../lib/python3.9/site-packages/distributed/core.py", line 1210, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File ".../lib/python3.9/site-packages/distributed/core.py", line 975, in send_recv
    response = await comm.read(deserializers=deserializers)
  File ".../lib/python3.9/site-packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File ".../lib/python3.9/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:47992 remote=tcp://127.0.0.1:35821>: Stream is closed
2022-12-16 03:55:45,972 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File ".../lib/python3.9/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".../lib/python3.9/site-packages/distributed/worker.py", line 1213, in heartbeat
    response = await retry_operation(
  File ".../lib/python3.9/site-packages/distributed/utils_comm.py", line 400, in retry_operation
    return await retry(
  File ".../lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry
    return await coro()
  File ".../lib/python3.9/site-packages/distributed/core.py", line 1210, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File ".../lib/python3.9/site-packages/distributed/core.py", line 975, in send_recv
    response = await comm.read(deserializers=deserializers)
  File ".../lib/python3.9/site-packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File ".../lib/python3.9/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:47998 remote=tcp://127.0.0.1:35821>: Stream is closed
Read more comments on GitHub >

github_iconTop Results From Across the Web

Mimic dataset · Issue #9766 · dask/dask - GitHub
Mimic dataset #9766 ... [QST]: p2p shuffle on large datasets dask/distributed#7380. Open. Sign up for free to join this conversation on ...
Read more >
How to shuffle a big dataset - Jane Street Tech Blog
These large datasets present a number of interesting engineering challenges. The one we address here: How do you shuffle a really large ...
Read more >
Network Shuffling: Privacy Amplification via Random Walks
We introduce network shuffling, a decentralized mech- anism where users exchange data in a random-walk fashion on a network/graph, as an ...
Read more >
[R] How to shuffle a big dataset : r/MachineLearning - Reddit
Here is a post about how to shuffle large datasets (where "large" ... Run and fine-tune BLOOM-176B at home using a peer-to-peer network....
Read more >
Joining and shuffling very large datasets using Cloud Dataflow
Shuffle is the base data transformation that enables grouping and joining of datasets. Over the past several months, we've shared many user ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found