Should Awkward Arrays be usable as Pandas columns?
See original GitHub issueThis was one of the design goals described in the original motivations document, but it has required some non-intuitive sorcery to implement and it’s not clear to me that it’s a valuable feature. To be clear, we’re talking about
>>> import awkward1 as ak
>>> import pandas as pd
>>> pd.DataFrame({"awkward": ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])})
awkward
0 [1.1, 2.2, 3.3]
1 []
2 [4.4, 5.5]
and not
>>> ak.pandas.df(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
values
entry subentry
0 0 1.1
1 2.2
2 3.3
2 0 4.4
1 5.5
The explicit conversion into a MultiIndex DataFrame with ak.pandas.df has no issues: the implementation is straightforward and I know how I would use it—there are plenty of Pandas functions for dealing with MultiIndex. For example,
>>> df = ak.pandas.df(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
>>> df.unstack()
values
subentry 0 1 2
entry
0 1.1 2.2 3.3
2 4.4 5.5 NaN
But for the Awkward-in-Pandas, the only things I know of that can be used directly are ufuncs:
>>> pd.DataFrame({"awkward": ak.Array([[1, 2, 3], [], [4, 5]])}) + 100
awkward
0 [101, 102, 103]
1 []
2 [104, 105]
but not all ufuncs, for some Pandas reason:
>>> np.sqrt(pd.DataFrame({"awkward": ak.Array([[1, 2, 3], [], [4, 5]])}))
Traceback (most recent call last):
File "/home/pivarski/irishep/awkward-1.0/awkward1/highlevel.py", line 996, in __getattr__
raise AttributeError("no field named {0}".format(repr(where)))
AttributeError: no field named 'sqrt'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: loop of ufunc does not support argument 0 of type Array which has no callable sqrt method
Presumably, we could narrow in on that reason and get it to work, but there are a lot of Pandas functions to test. The fundamental problem is that Awkward objects are “black boxes” to Pandas. Sure, we can put them in a DataFrame, but what’s Pandas going to do with them once they’re there?
There are other downsides to making Awkward Arrays subclasses of pandas.core.arrays.base.ExtensionArray (so that they can be columns). For one thing, it implies that we have to import pandas at startup, which can cost up to a second on slow machines or might try to import a broken installation of Pandas even if the user isn’t planning on using Pandas. (If Pandas is not installed, we can change the class hierarchy, but that means ak.Array behaves differently, depending on whether you’ve installed Pandas, even if you’re not using it.)
To avoid the above, the current implementation only makes ak.Array inherit from pandas.core.arrays.base.ExtensionArray if you try to use it in Pandas, which can be detected by a call to dtype. But for consistency, that’s even worse, since the inheritance of ak.Array now changes at runtime, depending on whether you’ve ever tried to use an Awkward Array in a DataFrame. This came up in a difference in behavior (reported on Slack) that I couldn’t reproduce at first because my test didn’t invoke Pandas. Namely, the pandas.core.arrays.base.ExtensionArray defines some methods, and these methods exist or don’t exist on ak.Array unless they’re overshadowed by my own implementations. At the very least, I should overshadow all the non-underscored ones so that their existence is not history-dependent, but it fills up the ak.Array namespace with names I don’t necessarily want.
to_numpy: This would be fine; it would call ak.to_numpy, though the other methods don’t have an underscore, such astolist(for consistency with NumPy).dtype: Already tricky, since Pandas requires a new one,AwkwardDType, and Dask requiresnp.dtype("O").shape: Pandas needs this to be one-dimensional, which is misleading for an Awkward Array. Preferably, Awkward Arrays would have noshapeat all; the combineddtypeandshapecan only be fully captured by ak.type.ndim: Much likeshape, it’s misleading for this to always be1.nbytes: This is fine, and other libraries expect such a property, too.astype: This was the surprise that triggered this issue: I didn’t think Awkward Arrays had anastype, since it’s not clear what it should mean. For changing numeric types, there’s an open PR #346, but it’s a new function since it doesn’t change the whole type of the array, it descends to the leaves where the numbers are.isna: This can go to ak.is_none, though “na” is not how we refer to missing data.argsort: This can go to ak.argsort.fillna: This can go to ak.fill_none, but see the note onisnaabove.dropna: We don’t have anak.drop_none, but such a thing wouldn’t be too hard to write.shift: This one only makes sense for rectangular tables. (See the definition.)unique: We don’t have anak.uniqueand there could be some subtitles there. We don’t have a definition for record equality, for example, and string equality is already handled through a behavioral extension.searchsorted: Only makes sense if the data are actually sorted. Should there be anaxis=1version of this for variable-length lists? Usually, physics events are unsorted but the particles (axis=1) are sorted bypT.factorize: This is a non-intuitive name, but it could be good to have an Awkward function that turns arrays into an IndexedArray of unique values. But for complex objects like records, this brings up the same issues asunique(above).repeat: We don’t have anak.repeat, but that might be useful in some contexts. I usually find np.repeat and np.tile to be a pair that have to be used together, usually to make a Cartesian product (and we already have ak.cartesian).take: This seems unnecessary to me, since we already have__getitem__with integer arrays.copy: I don’t know if we have a high-level “copy” function, but we have the low-level ones to link it up.view: This wouldn’t make much sense for an Awkward Array. It’s not a simple buffer.ravel: Maybe the equivalent of this is ak.flatten? Flattening variable-length arrays, particularly ones that include records, is a different kind of thing from flattening rectilinear data.
Given these mismatches, I’m strongly considering removing the Awkward-in-Pandas feature before Awkward1 actually becomes 1.0. The explicit conversion functions, ak.pandas.df and ak.pandas.dfs, would be kept.
But I might be wrong—there might be some fantastic use-case for Awkward-in-Pandas that I don’t know about. This question is an informal vote on the feature. You might have been sent here by an error message, where the feature is provisionally removed with a way to opt-in. If you find it useful to include Awkward Arrays inside of Pandas DataFrames (distinct from the ak.pandas.df conversion), then say so here, describing the use-case. You can opt-in now by calling ak.pandas.register(), but if I don’t hear from people saying that they really use it, the feature will be removed and you won’t be able to use it past 1.0.
So let me know!
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (7 by maintainers)

Top Related StackOverflow Question
The specific issue of
np.sqrt(dataframe)failing for DataFrames with extension arrays comes down to DataFrame not defining__array_ufunc__yet. That’s a known issue: https://github.com/pandas-dev/pandas/issues/23743 (I don’t think anyone is working on it at the moment). But to your next point;That’s the essential motivation for ExtensionArrays: a way for pandas and these black boxes of arrays to interact through a well-defined interface. For example, cyberpandas provides vectorized implementations of ipaddress operations to pandas. pandas doesn’t need to know about the memory layout of cyberpandas (a 2D int64 ndarray) or any IP operations for this to work.
Now, the interface is relatively young. Some things work and some things (as you’ve discovered) don’t. But it is improving with each release.
I personally wouldn’t recommend making general-purpose objects like AwkwardArray try to implement pandas’ Extension Array interface. As you note, there are some public methods that might clash with implementations in AwkwardArray. And I’ve never had good experiences making base classes dynamic. I’d instead recommend a dedicated object that implements the interface.
This raises some issues around putting
AwkwardArrayobjects into a pandas DataFrame, ifAwkwardArraydoesn’t implement the interface. I’m sure the pandas maintainers would be happy to discuss options there (like a__pandas_extension_array__interface that objects can implement to return a pandas’ extension array-compatible object. That would ensure thatpd.DataFrame({"A": my_awkward_array})keeps the data as an awkward array, rather than copying to an object-dtype ndarray.As general point though, the extension array interface is still evolving. If you run into issues please do speak up, either here or on the pandas issue tracker!
Awkward arrays as Pandas columns will be deprecated.
The next release will present a deprecation warning when you try to use an Awkward array in Pandas (as a Series or a DataFrame column) and it will be removed in 0.3.0.
The ak.pandas.df and ak.pandas.dfs functions will be combined and renamed as ak.to_pandas for consistency. The new function name already exists and the old ones will be removed in 0.3.0.