question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Should Awkward Arrays be usable as Pandas columns?

See original GitHub issue

This was one of the design goals described in the original motivations document, but it has required some non-intuitive sorcery to implement and it’s not clear to me that it’s a valuable feature. To be clear, we’re talking about

>>> import awkward1 as ak
>>> import pandas as pd
>>> pd.DataFrame({"awkward": ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])})
           awkward
0  [1.1, 2.2, 3.3]
1               []
2       [4.4, 5.5]

and not

>>> ak.pandas.df(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
                values
entry subentry        
0     0            1.1
      1            2.2
      2            3.3
2     0            4.4
      1            5.5

The explicit conversion into a MultiIndex DataFrame with ak.pandas.df has no issues: the implementation is straightforward and I know how I would use it—there are plenty of Pandas functions for dealing with MultiIndex. For example,

>>> df = ak.pandas.df(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
>>> df.unstack()
         values          
subentry      0    1    2
entry                    
0           1.1  2.2  3.3
2           4.4  5.5  NaN

But for the Awkward-in-Pandas, the only things I know of that can be used directly are ufuncs:

>>> pd.DataFrame({"awkward": ak.Array([[1, 2, 3], [], [4, 5]])}) + 100
           awkward
0  [101, 102, 103]
1               []
2       [104, 105]

but not all ufuncs, for some Pandas reason:

>>> np.sqrt(pd.DataFrame({"awkward": ak.Array([[1, 2, 3], [], [4, 5]])}))
Traceback (most recent call last):
  File "/home/pivarski/irishep/awkward-1.0/awkward1/highlevel.py", line 996, in __getattr__
    raise AttributeError("no field named {0}".format(repr(where)))
AttributeError: no field named 'sqrt'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: loop of ufunc does not support argument 0 of type Array which has no callable sqrt method

Presumably, we could narrow in on that reason and get it to work, but there are a lot of Pandas functions to test. The fundamental problem is that Awkward objects are “black boxes” to Pandas. Sure, we can put them in a DataFrame, but what’s Pandas going to do with them once they’re there?

There are other downsides to making Awkward Arrays subclasses of pandas.core.arrays.base.ExtensionArray (so that they can be columns). For one thing, it implies that we have to import pandas at startup, which can cost up to a second on slow machines or might try to import a broken installation of Pandas even if the user isn’t planning on using Pandas. (If Pandas is not installed, we can change the class hierarchy, but that means ak.Array behaves differently, depending on whether you’ve installed Pandas, even if you’re not using it.)

To avoid the above, the current implementation only makes ak.Array inherit from pandas.core.arrays.base.ExtensionArray if you try to use it in Pandas, which can be detected by a call to dtype. But for consistency, that’s even worse, since the inheritance of ak.Array now changes at runtime, depending on whether you’ve ever tried to use an Awkward Array in a DataFrame. This came up in a difference in behavior (reported on Slack) that I couldn’t reproduce at first because my test didn’t invoke Pandas. Namely, the pandas.core.arrays.base.ExtensionArray defines some methods, and these methods exist or don’t exist on ak.Array unless they’re overshadowed by my own implementations. At the very least, I should overshadow all the non-underscored ones so that their existence is not history-dependent, but it fills up the ak.Array namespace with names I don’t necessarily want.

  • to_numpy: This would be fine; it would call ak.to_numpy, though the other methods don’t have an underscore, such as tolist (for consistency with NumPy).
  • dtype: Already tricky, since Pandas requires a new one, AwkwardDType, and Dask requires np.dtype("O").
  • shape: Pandas needs this to be one-dimensional, which is misleading for an Awkward Array. Preferably, Awkward Arrays would have no shape at all; the combined dtype and shape can only be fully captured by ak.type.
  • ndim: Much like shape, it’s misleading for this to always be 1.
  • nbytes: This is fine, and other libraries expect such a property, too.
  • astype: This was the surprise that triggered this issue: I didn’t think Awkward Arrays had an astype, since it’s not clear what it should mean. For changing numeric types, there’s an open PR #346, but it’s a new function since it doesn’t change the whole type of the array, it descends to the leaves where the numbers are.
  • isna: This can go to ak.is_none, though “na” is not how we refer to missing data.
  • argsort: This can go to ak.argsort.
  • fillna: This can go to ak.fill_none, but see the note on isna above.
  • dropna: We don’t have an ak.drop_none, but such a thing wouldn’t be too hard to write.
  • shift: This one only makes sense for rectangular tables. (See the definition.)
  • unique: We don’t have an ak.unique and there could be some subtitles there. We don’t have a definition for record equality, for example, and string equality is already handled through a behavioral extension.
  • searchsorted: Only makes sense if the data are actually sorted. Should there be an axis=1 version of this for variable-length lists? Usually, physics events are unsorted but the particles (axis=1) are sorted by pT.
  • factorize: This is a non-intuitive name, but it could be good to have an Awkward function that turns arrays into an IndexedArray of unique values. But for complex objects like records, this brings up the same issues as unique (above).
  • repeat: We don’t have an ak.repeat, but that might be useful in some contexts. I usually find np.repeat and np.tile to be a pair that have to be used together, usually to make a Cartesian product (and we already have ak.cartesian).
  • take: This seems unnecessary to me, since we already have __getitem__ with integer arrays.
  • copy: I don’t know if we have a high-level “copy” function, but we have the low-level ones to link it up.
  • view: This wouldn’t make much sense for an Awkward Array. It’s not a simple buffer.
  • ravel: Maybe the equivalent of this is ak.flatten? Flattening variable-length arrays, particularly ones that include records, is a different kind of thing from flattening rectilinear data.

Given these mismatches, I’m strongly considering removing the Awkward-in-Pandas feature before Awkward1 actually becomes 1.0. The explicit conversion functions, ak.pandas.df and ak.pandas.dfs, would be kept.

But I might be wrong—there might be some fantastic use-case for Awkward-in-Pandas that I don’t know about. This question is an informal vote on the feature. You might have been sent here by an error message, where the feature is provisionally removed with a way to opt-in. If you find it useful to include Awkward Arrays inside of Pandas DataFrames (distinct from the ak.pandas.df conversion), then say so here, describing the use-case. You can opt-in now by calling ak.pandas.register(), but if I don’t hear from people saying that they really use it, the feature will be removed and you won’t be able to use it past 1.0.

So let me know!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
TomAugspurgercommented, Oct 5, 2020

But for the Awkward-in-Pandas, the only things I know of that can be used directly are ufuncs: […] but not all ufuncs, for some Pandas reason:

The specific issue of np.sqrt(dataframe) failing for DataFrames with extension arrays comes down to DataFrame not defining __array_ufunc__ yet. That’s a known issue: https://github.com/pandas-dev/pandas/issues/23743 (I don’t think anyone is working on it at the moment). But to your next point;

The fundamental problem is that Awkward objects are “black boxes” to Pandas. Sure, we can put them in a DataFrame, but what’s Pandas going to do with them once they’re there?

That’s the essential motivation for ExtensionArrays: a way for pandas and these black boxes of arrays to interact through a well-defined interface. For example, cyberpandas provides vectorized implementations of ipaddress operations to pandas. pandas doesn’t need to know about the memory layout of cyberpandas (a 2D int64 ndarray) or any IP operations for this to work.

Now, the interface is relatively young. Some things work and some things (as you’ve discovered) don’t. But it is improving with each release.

There are other downsides to making Awkward Arrays subclasses of pandas.core.arrays.base.ExtensionArray

I personally wouldn’t recommend making general-purpose objects like AwkwardArray try to implement pandas’ Extension Array interface. As you note, there are some public methods that might clash with implementations in AwkwardArray. And I’ve never had good experiences making base classes dynamic. I’d instead recommend a dedicated object that implements the interface.

This raises some issues around putting AwkwardArray objects into a pandas DataFrame, if AwkwardArray doesn’t implement the interface. I’m sure the pandas maintainers would be happy to discuss options there (like a __pandas_extension_array__ interface that objects can implement to return a pandas’ extension array-compatible object. That would ensure that pd.DataFrame({"A": my_awkward_array}) keeps the data as an awkward array, rather than copying to an object-dtype ndarray.

As general point though, the extension array interface is still evolving. If you run into issues please do speak up, either here or on the pandas issue tracker!

1reaction
jpivarskicommented, Aug 6, 2020

Awkward arrays as Pandas columns will be deprecated.

The next release will present a deprecation warning when you try to use an Awkward array in Pandas (as a Series or a DataFrame column) and it will be removed in 0.3.0.

The ak.pandas.df and ak.pandas.dfs functions will be combined and renamed as ak.to_pandas for consistency. The new function name already exists and the old ones will be removed in 0.3.0.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Should Awkward Arrays be usable as Pandas columns? #350
The fundamental problem is that Awkward objects are "black boxes" to Pandas. Sure, we can put them in a DataFrame, but what's Pandas...
Read more >
How to convert to Pandas - Awkward Array
Although nested lists and records can be represented using Pandas's MultiIndex, different-length lists in the same data structure can only be translated without ......
Read more >
ak.Array — Awkward Array 2.0.0 documentation
Pandas. Ragged arrays (list type) can be converted into Pandas MultiIndex rows and nested records can be converted into MultiIndex columns.
Read more >
Best way to save a dict of awkward1 arrays? - Stack Overflow
to_parquet uses for column names come from the records of the Awkward Array itself. Different fields in a record can have different data...
Read more >
Awkward-Pandas - CERN Indico
Pandas. Great python tool for manipulating columns of NumPy arrays. • Tightly integrated with NumPy and SciPy. • Very popular. • Open source....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found