Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance of ufuncs

See original GitHub issue

Raised by @tamasgal in https://github.com/scikit-hep/uproot4/issues/90#issuecomment-689459702:

Another problem, which I will post in a future issue that working with these arrays is far from the numpy performance (doing things like arr > 0.5 takes ~2ms for 100 entries, while in numpy/Julia/C it should be more around a few hundred ns), but that’s another story.

Original dataset: http://131.188.167.67:8889/doubly_jagged.root

import uproot4
import awkward1 as ak
trks = uproot4.open("uproot-issue-90.root:E/Evt/trks")
array = trks["trks.rec_stages"].array()
ak.to_parquet(array, "awkward-issue-442.parquet")
array
# <Array [[[1, 2, 5, 3, 5, 4], ... 1], [1], [1]]] type='145028 * var * var * int64'>

As Parquet (faster to download and read into Awkward): https://drive.google.com/file/d/1JbiFaBaouH_amUxvGnsSHegYAQjRTJ8u/view?usp=sharing As Pickle (larger, but retains structure: Parquet adds option-type, which complicates the performance analysis): https://drive.google.com/file/d/1KnYebahkvLK29ZggISGROHxjcpCUdO0H/view?usp=sharing

The basic idea of performance in Awkward Array is that we don’t worry about the constant-time metadata manipulation but should worry about the linear-time scaling. In particular, computing array > 3 of a doubly jagged array is pure Python: it unwraps the doubly jagged structure and calls NumPy’s own np.greater on the inner flat content.

Because of the constant-time unwrapping, it shouldn’t be surprising that the Awkward version doesn’t start scaling until the array is at least 1000 entries or so. What is surprising is that the linear scaling for Awkward doesn’t line up with the linear scaling for NumPy, because in theory, it isn’t doing anything other than calling NumPy.

quick-plot

So this is a quandary.

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

jpivarskicommented, Sep 9, 2020

Looking at your issue, I don’t see how they’re related. They might be related. I need to do a quick profile of this because there’s a lot of linear time here that isn’t due to NumPy, and I can’t imagine what it might be.

0reactions

tamasgalcommented, Sep 9, 2020

Awesome, thanks for both the implementation and the fix!

…and of course for the detailed description of the whole problem