ENH: Support min/max on ArrowStringArray
See original GitHub issueMotivation
In order for Dask to perform large shuffles (set_index, join on a non-index column, …) on a column it needs to be able to compute quantiles.
To do this it is useful to compute min/max values.
What actually breaks
When I try to do this on columns of type string[pyarrow]
I get the following exception
import pandas as pd
s = pd.Series(["a", "b", "c"]).astype("string[pyarrow]")
s.min()
~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs)
10825 )
10826 def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10827 return NDFrame.min(self, axis, skipna, level, numeric_only, **kwargs)
10828
10829 setattr(cls, "min", min)
~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs)
10348
10349 def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10350 return self._stat_function(
10351 "min", nanops.nanmin, axis, skipna, level, numeric_only, **kwargs
10352 )
~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in _stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs)
10343 name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only
10344 )
> 10345 return self._reduce(
10346 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
10347 )
~/miniconda/lib/python3.8/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
4380 if isinstance(delegate, ExtensionArray):
4381 # dispatch to ExtensionArray interface
-> 4382 return delegate._reduce(name, skipna=skipna, **kwds)
4383
4384 else:
~/miniconda/lib/python3.8/site-packages/pandas/core/arrays/string_arrow.py in _reduce(self, name, skipna, **kwargs)
377 def _reduce(self, name: str, skipna: bool = True, **kwargs):
378 if name in ["min", "max"]:
--> 379 return getattr(self, name)(skipna=skipna)
380
381 raise TypeError(f"Cannot perform reduction '{name}' with string dtype")
AttributeError: 'ArrowStringArray' object has no attribute 'min'
Solution
I am hopeful that Arrow maybe already has an min/max implementation and they just haven’t been hooked up yet.
Issue Analytics
- State:
- Created 2 years ago
- Comments:19 (18 by maintainers)
Top Results From Across the Web
@mrocklin@fosstodon.org on Twitter: "The new Pandas 1.3 ...
ENH : Support min/max on ArrowStringArray · Issue #42597 · pandas-dev/pandas. Motivation In order for Dask to perform large shuffles ...
Read more >minmax() - CSS: Cascading Style Sheets - MDN Web Docs
The minmax() CSS function defines a size range greater than or equal to min and less than or equal to max. It is...
Read more >Changelog - Dask documentation
This is the last release with support for NumPy 1.17 and pandas 0.25. Beginning with the next release, NumPy 1.18 and pandas 1.0...
Read more >"minmax()" | Can I use... Support tables for HTML5, CSS3, etc
"Can I use" provides up-to-date browser support tables for support of front-end web technologies on desktop and mobile web browsers.
Read more >Intrinsically Responsive CSS Grid with minmax() and min()
The browser support isn't widespread yet, but Evan demonstrates some progressive enhancement techniques to take advantage of now.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
This works now on main and 1.5. I believe this was fixed by https://github.com/pandas-dev/pandas/pull/47730 when
ArrowStringArray
inherited fromArrowExtensionArray
which defined_reduce
. ClosingIMO, it’s not worth rushing a short-term fix. I opened https://issues.apache.org/jira/browse/ARROW-13410. I have no sense for how difficult this would be to implement in Arrow, but if there’s already support for sorting then perhaps min / max won’t be too difficult.