question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: Support min/max on ArrowStringArray

See original GitHub issue

Motivation

In order for Dask to perform large shuffles (set_index, join on a non-index column, …) on a column it needs to be able to compute quantiles.

To do this it is useful to compute min/max values.

What actually breaks

When I try to do this on columns of type string[pyarrow] I get the following exception

import pandas as pd
s = pd.Series(["a", "b", "c"]).astype("string[pyarrow]")
s.min()
~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs)
  10825         )
  10826         def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10827             return NDFrame.min(self, axis, skipna, level, numeric_only, **kwargs)
  10828 
  10829         setattr(cls, "min", min)

~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs)
  10348 
  10349     def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10350         return self._stat_function(
  10351             "min", nanops.nanmin, axis, skipna, level, numeric_only, **kwargs
  10352         )

~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in _stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs)
  10343                 name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only
  10344             )
> 10345         return self._reduce(
  10346             func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  10347         )

~/miniconda/lib/python3.8/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   4380         if isinstance(delegate, ExtensionArray):
   4381             # dispatch to ExtensionArray interface
-> 4382             return delegate._reduce(name, skipna=skipna, **kwds)
   4383 
   4384         else:

~/miniconda/lib/python3.8/site-packages/pandas/core/arrays/string_arrow.py in _reduce(self, name, skipna, **kwargs)
    377     def _reduce(self, name: str, skipna: bool = True, **kwargs):
    378         if name in ["min", "max"]:
--> 379             return getattr(self, name)(skipna=skipna)
    380 
    381         raise TypeError(f"Cannot perform reduction '{name}' with string dtype")

AttributeError: 'ArrowStringArray' object has no attribute 'min'

Solution

I am hopeful that Arrow maybe already has an min/max implementation and they just haven’t been hooked up yet.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:19 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
mroeschkecommented, Oct 13, 2022

This works now on main and 1.5. I believe this was fixed by https://github.com/pandas-dev/pandas/pull/47730 when ArrowStringArray inherited from ArrowExtensionArray which defined _reduce. Closing

1reaction
TomAugspurgercommented, Jul 20, 2021

IMO, it’s not worth rushing a short-term fix. I opened https://issues.apache.org/jira/browse/ARROW-13410. I have no sense for how difficult this would be to implement in Arrow, but if there’s already support for sorting then perhaps min / max won’t be too difficult.

Read more comments on GitHub >

github_iconTop Results From Across the Web

@mrocklin@fosstodon.org on Twitter: "The new Pandas 1.3 ...
ENH : Support min/max on ArrowStringArray · Issue #42597 · pandas-dev/pandas. Motivation In order for Dask to perform large shuffles ...
Read more >
minmax() - CSS: Cascading Style Sheets - MDN Web Docs
The minmax() CSS function defines a size range greater than or equal to min and less than or equal to max. It is...
Read more >
Changelog - Dask documentation
This is the last release with support for NumPy 1.17 and pandas 0.25. Beginning with the next release, NumPy 1.18 and pandas 1.0...
Read more >
"minmax()" | Can I use... Support tables for HTML5, CSS3, etc
"Can I use" provides up-to-date browser support tables for support of front-end web technologies on desktop and mobile web browsers.
Read more >
Intrinsically Responsive CSS Grid with minmax() and min()
The browser support isn't widespread yet, but Evan demonstrates some progressive enhancement techniques to take advantage of now.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found