Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using .str functions

See original GitHub issue

I have tried, perhaps incorrectly, to convert my column to pyarrow string type as follows:

fletcher_string_dtype = fr.FletcherDtype(pa.string())
df['string_col'] = df.string_col.astype(fletcher_string_type)

But now I can’t do string functions on it because I get the error message AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Specifically, I’m trying to do .str.contains()

I may be casting column incorrectly. It may be that there’s no value in using fletcher for this.

I saw in your talk, groupby was a nice use case. Related to this question is what are the best use cases for this dtype - just a link to some additional reading material would be great.

Issue Analytics

State:
Created 4 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, Apr 23, 2019

On Apr 23, 2019, at 16:07, Dave Hirschfeld notifications@github.com wrote:

Pandas doesn’t give fletcher any way to use .str. I don’t think we should since I’m interested in properly supporting strings in pandas sometime this year.

Unless you’re suggesting that the pandas default implementation will work directly with arrow data (in fletcher arrays)

That’s what I’m suggesting.

I’d disagree with this position - I don’t want to be forced to coerce my arrow data to pandas to do basic manipulations and I also don’t want the maintenance burden of 2 separate implementations.

I think pandas should make both .str and .dt available to be overridden by different (extension) dtypes with implementations that work / make sense / are performant for that data type.

The concept is similar to numpy’s array_function protocol whereby different array implementations can override the default numpy implementation thereby allowing users to write generic code that works for numpy arrays, cupy arrays, sparse arrays, etc…

I’d like my transform functions to work seamlessly with either python/pandas strings or with arrow/fletcher strings. Of course, I don’t know if this may be an unreasonable hope given technical constraints but I think it’s something worth striving for with the benefits similar to that provided by numpy’s NEP-18.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

0reactions

dhirschfeldcommented, Apr 23, 2019

Pandas doesn’t give fletcher any way to use .str. I don’t think we should since I’m interested in properly supporting strings in pandas sometime this year.

Unless you’re suggesting that the pandas default implementation will work directly with arrow data (in fletcher arrays) I’d disagree with this position - I don’t want to be forced to coerce my arrow data to pandas to do basic manipulations and I also don’t want the maintenance burden of 2 separate implementations.

I think pandas should make both .str and .dt available to be overridden by different (extension) dtypes with implementations that work / make sense / are performant for that data type.

The concept is similar to numpy’s __array_function__ protocol whereby different array implementations can override the default numpy implementation thereby allowing users to write generic code that works for numpy arrays, cupy arrays, sparse arrays, etc…

I’d like my transform functions to work seamlessly with either python/pandas strings or with arrow/fletcher strings. Of course, I don’t know if this may be an unreasonable hope given technical constraints but I think it’s something worth striving for with the benefits similar to that provided by numpy’s NEP-18.