Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

refactor: unify default array representation in Pandas and Dask backends

See original GitHub issue

#2619 will cover making the Pandas and Dask backends accept either np.arrays or lists (in other words, be able to execute operations on array expressions, regardless of whether they are represented by lists or by np.arrays).

However, the Ibis API has several functions that can create array expressions for the user, and unfortunately, after #2615, the Pandas and Dask backends are no longer consistent in what kind of arrays they create—they may output either lists or np.arrays, depending on the part of the API.

API	Operation	Creates	Represeted by
`ibis.literal([1, 2, 3])`	`Literal`	`ArrayScalar`	`list`
`ibis.literal(np.array([1, 2, 3]))`	`Literal`	`ArrayScalar`	`np.array`
`ibis.array([1, 2, 3])`	`Literal`	`ArrayScalar`	`list`
`ibis.array(np.array([1, 2, 3]))`	`Literal`	`ArrayScalar`	`list`
`ibis.array([t.int_col, t.other_int_col])`	`ArrayColumn`	`ArrayColumn`	`np.arrays` (elements)
`arr_scalar[1:3]`	`ArraySlice`	`ArrayScalar`	`list`
`t.arr_col[1:3]`	`ArraySlice`	`ArrayColumn`	`list` (elements)
`arr_scalar + other_arr_scalar`	`ArrayConcat`	`ArrayScalar`	`list`
`t.arr_col + arr_scalar`	`ArrayConcat`	`ArrayColumn`	💥 (not working)
`t.arr_col + t.other_arr_col`	`ArrayConcat`	`ArrayColumn`	`list` (elements)
`arr_scalar * 5`	`ArrayRepeat`	`ArrayScalar`	`list`
`t.arr_col * 5`	`ArrayRepeat`	`ArrayColumn`	`list` (elements)
`t.int_col.collect()`	`ArrayCollect`	`ArrayScalar`	`list`
`t.int_col.quantile([0.25, 0.50, 0.75])`	`MultiQuantile`	`ArrayScalar`	`list`
`map_scalar.values()`	`MapValues`	`ArrayScalar`	`list`
`t.map_col.values()`	`MapValues`	`ArrayColumn`	`list` (elements)
`map_scalar.keys()`	`MapKeys`	`ArrayScalar`	`list`
`t.map_col.keys()`	`MapKeys`	`ArrayColumn`	`list` (elements)
`t.str_col.split(',')`	`StringSplit`	`ArrayColumn`	`list` (elements)
`elem_udf_list(t.int_col)`	`ElementWiseVectorizedUDF`	`ArrayColumn`	`list` (elements)
`elem_udf_ndarray(t.int_col)`	`ElementWiseVectorizedUDF`	`ArrayColumn`	`np.arrays` (elements)
`analytic_udf_list(t.int_col)`	`AnalyticVectorizedUDF`	`ArrayColumn`	`list` (elements)
`analytic_udf_ndarray(t.int_col)`	`AnalyticVectorizedUDF`	`ArrayColumn`	`np.arrays` (elements)
`reduc_udf_list(t.int_col)`	`ReductionVectorizedUDF`	`ArrayScalar`	`list`

We should make the return values for all of these operations consistent (otherwise it’s hard the user to predict what their arrays in the final executed DataFrame will look like), but I wanted to get some feedback on whether we want the default representation to be list or np.array.

I think arrays as np.arrays would be useful and I don’t see much reason to prefer lists over np.arrays—however I think most other backends represent arrays using lists (in the Pandas DataFrames resulting from execution).

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

tswastcommented, Feb 23, 2021

Related: https://github.com/ibis-project/ibis/issues/2377 has some thoughts on supporting lists of structs via Arrow.

0reactions

cpcloudcommented, Jan 11, 2022

@timothydijamco Thanks for the great issue write up. It really warms my heart to get issues with descriptions like this. It makes maintainers lives really easy.

Top Results From Across the Web

Selecting the collection backend - Dask documentation

A custom Dask-Array backend should define a subclass of DaskArrayBackendEntrypoint (defined in dask.array.backends ), while a custom Dask-DataFrame backend ...

Data API · Issue #269 · holoviz/holoviews · GitHub

dimension_values : Returns an array, list or pandas series of one column; range : Returns the minimum and maximum values along a dimension...

Release Notes - Ibis Project

The Ibis compiler has been refactored, and backends don't need to implement all compiler classes anymore if the default works for them.

What's New - Xarray

Expose inline_array kwarg from dask.array.from_array in open_dataset() , Dataset.chunk() ... Also refactor the internal, pandas-specific implementation into ...

What's new in 1.0.0 (January 29, 2020) - Pandas

NA value (singleton) is introduced to represent scalar missing values. ... The default bool data type based on a bool-dtype NumPy array, ...