question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: unnecessary (expensive) concat

See original GitHub issue

I have ray worker that is calling PandasDataframeAxisPartition.deploy_axis_func and in that doing pandas.concat on 16 DataFrames with MultiIndex indexes, an expensive concat.

AFAIK there isn’t a nice way to see what called deploy_axis_func, so this is a bit speculative.

I think the partitions being collected are exactly the partitions of an existing DataFrame, which I think means that frame’s index is already materialized somewhere, so reconstructing it inside concat is unnecessary. i.e. in deploy_axis_func we could do something like

+orig_indexes = [x.index for x in partitions]
+N = 0
+for obj in partitions:
+    obj.index = range(N, N+len(obj))
+    N += len(obj)

dataframe = pandas.concat(list(partitions), axis=axis, copy=False)

+dataframe.index = thing_we_already_know_so_dont_need_to_recompute
+
+for index, obj in zip(orig_indexes, partitions):
+    obj.index = index

If I’m right here, we could see significant savings. e.g. in the script im profiling, ATM 5% is spent in _get_concat_axis, and I think a lot more indirectly inside worker processes.

Moreover, if this works, we could do the pinning/unpinning before pickling/unpickling and save on pickle costs.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
jbrockmendelcommented, Aug 1, 2022

Could you share the script you are using to profile?

I’m not sure that’s allowed. If it helps, @yarshev and @anmyachev are looking at the same script.

1reaction
jbrockmendelcommented, Aug 2, 2022

Is there the easy way to speed up concatenating the MultiIndex itself on pandas side?

There’s a patch that speeds up this particular case, but may slow down other cases (so i haven’t decided yet whether to upstream it to pandas):

orig_mi_append = pandas.MultiIndex.append

def new_append(self, other):
    if not isinstance(other, list):
        other = [other]

    if all(isinstance(obj, pandas.MultiIndex) for obj in other):
        if all(obj.nlevels == self.nlevels for obj in other):
            if all(all(pandas.core.dtypes.missing.array_equivalent(slev, olev) for slev, olev in zip(self._levels, obj._levels)) for obj in other):
                objs = [self] + other
                new_codes = []
                for i in range(self.nlevels):
                    lev_codes = np.concatenate([obj.codes[i] for obj in objs])
                    new_codes.append(lev_codes)
                mi = pandas.MultiIndex(codes=new_codes, levels=self.levels, names=self.names)
                return mi
    return orig_mi_append(self, other)

pandas.MultiIndex.append = new_append
Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficient string concatenation in C++ - Stack Overflow
I heard a few people expressing worries about "+" operator in std::string and various workarounds to speed up concatenation.
Read more >
Concatenating Strings Efficiently - Jon Skeet
NET developers learn is "use StringBuilder to concatenate strings". A little bit like "exceptions are expensive" this is a misunderstood piece of received ......
Read more >
Are there performance benefits to ember-template-lint/no ...
So yes, you really do get an extra concatenation at runtime when you use the quotes. Is that expensive enough to matter? I...
Read more >
We Don't Need StringBuilder for Simple Concatenation - DZone
Concatenating strings is useful, but expensive. Fortunately, you don't need to use StringBuilder anymore - the compiler can handle it for you.
Read more >
10 String Concatenation Best Practices - CLIMB
Caching the result of expensive string operations can help improve performance by reducing the amount of time spent on costly operations. This ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found