PERF: unnecessary (expensive) concat
See original GitHub issueI have ray worker that is calling PandasDataframeAxisPartition.deploy_axis_func
and in that doing pandas.concat
on 16 DataFrames with MultiIndex indexes, an expensive concat.
AFAIK there isn’t a nice way to see what called deploy_axis_func, so this is a bit speculative.
I think the partitions being collected are exactly the partitions of an existing DataFrame, which I think means that frame’s index is already materialized somewhere, so reconstructing it inside concat is unnecessary. i.e. in deploy_axis_func we could do something like
+orig_indexes = [x.index for x in partitions]
+N = 0
+for obj in partitions:
+ obj.index = range(N, N+len(obj))
+ N += len(obj)
dataframe = pandas.concat(list(partitions), axis=axis, copy=False)
+dataframe.index = thing_we_already_know_so_dont_need_to_recompute
+
+for index, obj in zip(orig_indexes, partitions):
+ obj.index = index
If I’m right here, we could see significant savings. e.g. in the script im profiling, ATM 5% is spent in _get_concat_axis
, and I think a lot more indirectly inside worker processes.
Moreover, if this works, we could do the pinning/unpinning before pickling/unpickling and save on pickle costs.
Issue Analytics
- State:
- Created a year ago
- Comments:8 (8 by maintainers)
I’m not sure that’s allowed. If it helps, @yarshev and @anmyachev are looking at the same script.
There’s a patch that speeds up this particular case, but may slow down other cases (so i haven’t decided yet whether to upstream it to pandas):