25% performance regression in merges
See original GitHub issueOur weekly multi-node benchmarking (working on making this publicly visible) shows a performance regression in simple dataframe merges, which I can pinpoint to #6975. (This was briefly reverted in #6994 and then reintroduced in #7007).
More specifically, #6975 changes the decision making in _select_keys_for_gather
:
Prior to this change the logic was
Note the difference in whether we fetch the top priority task. If I remove the part of the decision making logic that looks at self.incoming_transfer_bytes
:
if (
to_gather
and total_nbytes + ts.get_nbytes() > bytes_left_to_fetch
):
Then performance goes back to where it was previously.
Not sure the correct way to square this circle. I don’t understand the how the change in _select_keys_for_gather
interacts with the intention of the PR to throttle data transfer.
cc @hendrikmakait (as author of #6975)
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:6 (6 by maintainers)
Top GitHub Comments
Setting
export DASK_DISTRIBUTED__WORKER__MEMORY__TRANSFER=1
(which I think is the maximum value) doesn’t improve things. In fact, it appears that setting this value doesn’t really have an effect at all for this workload (I get effectively the same throughput withexport DASK_DISTRIBUTED__WORKER__MEMORY__TRANSFER=0.00000001
).Inspecting the values of
self.transfer_incoming_bytes_limit
,self.transfer_incoming_bytes
, andself.transfer_message_target_bytes
, it appears that the limit onbytes_left_to_fetch
is always coming fromself.transfer_message_target_bytes
(which is hard-coded at 50MB).These benchmarks are running on a high-performance network (depending on the worker pairings between 12 and 45 GiB/s uni-directional bandwidth), so the default to limit grabbing multiple “small” messages from a single worker at 50MB total is getting in the way (I can send multiple GiBs of data in less than a second).
I think what is happening is that previously there might have been two messages in flight between any given pair of workers at any one time, whereas now the changed logic means we limit to a single message.
So I think that #6975 fixed the logic in terms of limiting wrt
transfer_message_target_bytes
, but this turns out to be bad in some settings. One way to fix this is add configuration fortransfer_message_target_bytes
, I suppose.Fixed by https://github.com/dask/distributed/pull/7071