question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

refactor: unify default array representation in Pandas and Dask backends

See original GitHub issue

#2619 will cover making the Pandas and Dask backends accept either np.arrays or lists (in other words, be able to execute operations on array expressions, regardless of whether they are represented by lists or by np.arrays).

However, the Ibis API has several functions that can create array expressions for the user, and unfortunately, after #2615, the Pandas and Dask backends are no longer consistent in what kind of arrays they create—they may output either lists or np.arrays, depending on the part of the API.

API Operation Creates Represeted by
ibis.literal([1, 2, 3]) Literal ArrayScalar list
ibis.literal(np.array([1, 2, 3])) Literal ArrayScalar np.array
ibis.array([1, 2, 3]) Literal ArrayScalar list
ibis.array(np.array([1, 2, 3])) Literal ArrayScalar list
ibis.array([t.int_col, t.other_int_col]) ArrayColumn ArrayColumn np.arrays (elements)
arr_scalar[1:3] ArraySlice ArrayScalar list
t.arr_col[1:3] ArraySlice ArrayColumn list (elements)
arr_scalar + other_arr_scalar ArrayConcat ArrayScalar list
t.arr_col + arr_scalar ArrayConcat ArrayColumn 💥 (not working)
t.arr_col + t.other_arr_col ArrayConcat ArrayColumn list (elements)
arr_scalar * 5 ArrayRepeat ArrayScalar list
t.arr_col * 5 ArrayRepeat ArrayColumn list (elements)
t.int_col.collect() ArrayCollect ArrayScalar list
t.int_col.quantile([0.25, 0.50, 0.75]) MultiQuantile ArrayScalar list
map_scalar.values() MapValues ArrayScalar list
t.map_col.values() MapValues ArrayColumn list (elements)
map_scalar.keys() MapKeys ArrayScalar list
t.map_col.keys() MapKeys ArrayColumn list (elements)
t.str_col.split(',') StringSplit ArrayColumn list (elements)
elem_udf_list(t.int_col) ElementWiseVectorizedUDF ArrayColumn list (elements)
elem_udf_ndarray(t.int_col) ElementWiseVectorizedUDF ArrayColumn np.arrays (elements)
analytic_udf_list(t.int_col) AnalyticVectorizedUDF ArrayColumn list (elements)
analytic_udf_ndarray(t.int_col) AnalyticVectorizedUDF ArrayColumn np.arrays (elements)
reduc_udf_list(t.int_col) ReductionVectorizedUDF ArrayScalar list

We should make the return values for all of these operations consistent (otherwise it’s hard the user to predict what their arrays in the final executed DataFrame will look like), but I wanted to get some feedback on whether we want the default representation to be list or np.array.

I think arrays as np.arrays would be useful and I don’t see much reason to prefer lists over np.arrays—however I think most other backends represent arrays using lists (in the Pandas DataFrames resulting from execution).

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
tswastcommented, Feb 23, 2021

Related: https://github.com/ibis-project/ibis/issues/2377 has some thoughts on supporting lists of structs via Arrow.

0reactions
cpcloudcommented, Jan 11, 2022

@timothydijamco Thanks for the great issue write up. It really warms my heart to get issues with descriptions like this. It makes maintainers lives really easy.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Selecting the collection backend - Dask documentation
A custom Dask-Array backend should define a subclass of DaskArrayBackendEntrypoint (defined in dask.array.backends ), while a custom Dask-DataFrame backend ...
Read more >
Data API · Issue #269 · holoviz/holoviews · GitHub
dimension_values : Returns an array, list or pandas series of one column; range : Returns the minimum and maximum values along a dimension...
Read more >
Release Notes - Ibis Project
The Ibis compiler has been refactored, and backends don't need to implement all compiler classes anymore if the default works for them.
Read more >
What's New - Xarray
Expose inline_array kwarg from dask.array.from_array in open_dataset() , Dataset.chunk() ... Also refactor the internal, pandas-specific implementation into ...
Read more >
What's new in 1.0.0 (January 29, 2020) - Pandas
NA value (singleton) is introduced to represent scalar missing values. ... The default bool data type based on a bool-dtype NumPy array, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found