refactor: unify default array representation in Pandas and Dask backends
See original GitHub issue#2619 will cover making the Pandas and Dask backends accept either np.arrays
or lists
(in other words, be able to execute operations on array expressions, regardless of whether they are represented by lists
or by np.arrays
).
However, the Ibis API has several functions that can create array expressions for the user, and unfortunately, after #2615, the Pandas and Dask backends are no longer consistent in what kind of arrays they create—they may output either lists
or np.arrays
, depending on the part of the API.
API | Operation | Creates | Represeted by |
---|---|---|---|
ibis.literal([1, 2, 3]) |
Literal |
ArrayScalar |
list |
ibis.literal(np.array([1, 2, 3])) |
Literal |
ArrayScalar |
np.array |
ibis.array([1, 2, 3]) |
Literal |
ArrayScalar |
list |
ibis.array(np.array([1, 2, 3])) |
Literal |
ArrayScalar |
list |
ibis.array([t.int_col, t.other_int_col]) |
ArrayColumn |
ArrayColumn |
np.arrays (elements) |
arr_scalar[1:3] |
ArraySlice |
ArrayScalar |
list |
t.arr_col[1:3] |
ArraySlice |
ArrayColumn |
list (elements) |
arr_scalar + other_arr_scalar |
ArrayConcat |
ArrayScalar |
list |
t.arr_col + arr_scalar |
ArrayConcat |
ArrayColumn |
💥 (not working) |
t.arr_col + t.other_arr_col |
ArrayConcat |
ArrayColumn |
list (elements) |
arr_scalar * 5 |
ArrayRepeat |
ArrayScalar |
list |
t.arr_col * 5 |
ArrayRepeat |
ArrayColumn |
list (elements) |
t.int_col.collect() |
ArrayCollect |
ArrayScalar |
list |
t.int_col.quantile([0.25, 0.50, 0.75]) |
MultiQuantile |
ArrayScalar |
list |
map_scalar.values() |
MapValues |
ArrayScalar |
list |
t.map_col.values() |
MapValues |
ArrayColumn |
list (elements) |
map_scalar.keys() |
MapKeys |
ArrayScalar |
list |
t.map_col.keys() |
MapKeys |
ArrayColumn |
list (elements) |
t.str_col.split(',') |
StringSplit |
ArrayColumn |
list (elements) |
elem_udf_list(t.int_col) |
ElementWiseVectorizedUDF |
ArrayColumn |
list (elements) |
elem_udf_ndarray(t.int_col) |
ElementWiseVectorizedUDF |
ArrayColumn |
np.arrays (elements) |
analytic_udf_list(t.int_col) |
AnalyticVectorizedUDF |
ArrayColumn |
list (elements) |
analytic_udf_ndarray(t.int_col) |
AnalyticVectorizedUDF |
ArrayColumn |
np.arrays (elements) |
reduc_udf_list(t.int_col) |
ReductionVectorizedUDF |
ArrayScalar |
list |
We should make the return values for all of these operations consistent (otherwise it’s hard the user to predict what their arrays in the final executed DataFrame will look like), but I wanted to get some feedback on whether we want the default representation to be list
or np.array
.
I think arrays as np.arrays
would be useful and I don’t see much reason to prefer lists
over np.arrays
—however I think most other backends represent arrays using lists
(in the Pandas DataFrames resulting from execution).
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
Related: https://github.com/ibis-project/ibis/issues/2377 has some thoughts on supporting lists of structs via Arrow.
@timothydijamco Thanks for the great issue write up. It really warms my heart to get issues with descriptions like this. It makes maintainers lives really easy.