Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Removing ufunc-broadcasting across record fields

See original GitHub issue

Currently, all ufuncs are broadcasted across all fields of a record:

>>> ak_array = ak.Array([{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}])
>>> ak_array
<Array [{x: 1, y: 1.1}, ... {x: 3, y: 3.3}] type='3 * {"x": int64, "y": float64}'>

>>> ak_array + 1
<Array [{x: 2, y: 2.1}, ... {x: 4, y: 4.3}] type='3 * {"x": int64, "y": float64}'>

This is causing some confusion because the fields of a record have qualitatively different meanings. Some are trigger booleans, some are momenta, some are ML-derived isolation variables, some are strings…

>>> ak.Array(["HAL"]) + 1                      # should this even work?
<Array [[73, 66, 77]] type='1 * var * uint8'>

>>> [chr(x) for x in (ak.Array(["HAL"]) + 1)[0].tolist()]
['I', 'B', 'M']

Furthermore, when @henryiii is writing vector, he has to distinguish between LorentzVector + LorentzVector accidentally working because they’re Cartesian (but not preserving their Lorentzness) and getting the wrong answer because they’re not Cartesian. Even though the + behavior is defined, due to the fact that they are records, he has to be sure to override every case.

I think there would be fewer surprises for both users and developers if broadcasting a ufunc through a record were an error (withe a nice error message). Custom behaviors for specialized records, like LorentzVectors, would still be possible to define, as they are now, but instead of replacing wrong behavior, they’d be replacing no behavior.

Note that NumPy does not define such an operation on structured arrays:

>>> np_array = np.array([(1, 1.1), (2, 2.2), (3, 3.3)], [("x", int), ("y", float)])
>>> np_array
array([(1, 1.1), (2, 2.2), (3, 3.3)], dtype=[('x', '<i8'), ('y', '<f8')])

>>> np_array + 1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: invalid type promotion

Although it does work for Pandas:

>>> df = pd.DataFrame({"x": [1, 2, 3], "y": [1.1, 2.2, 3.3]})
>>> df
   x    y
0  1  1.1
1  2  2.2
2  3  3.3
>>> df + 1
   x    y
0  2  2.1
1  3  3.2
2  4  4.3

it is not our intention to generalize from Pandas, only NumPy.

This would also affect ufuncs that return booleans, like comparison operators. For these, the argument isn’t as strong. Maybe we want

>>> ak_array > 1
<Array [{x: False, y: True}, ... y: True}] type='3 * {"x": bool, "y": bool}'>

to work, maybe we don’t.

I’m considering removing ufuncs-through-records for all ufuncs, without affecting the custom ufunc behavior that can be assigned to any record with a name. (I’m not considering ufuncs-on-strings right now, though that’s something to think about.) Does anyone have a strong argument about that?

(I suppose this needs a deprecation cycle, though it would be a little difficult getting a warning into the middle of the broadcast-and-apply. I’m tempted to remove it all at once, like a band-aid…)

Issue Analytics

State:
Created 3 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

jpivarskicommented, Oct 30, 2020

I’m calling it a bug because I’ve just defined the current behavior as wrong. (Even though I’ve presented it in talks.)

(A motivator for being short with these things is that the list of issues is a lot longer than I thought, and I have only until December 1 to make this awkward==1.0.0.)

0reactions

lgraycommented, Nov 3, 2020

Looks like the correct behavior to me. 😃