API: astype mechanism for extension arrays
See original GitHub issueIn https://github.com/pandas-dev/pandas/pull/22343, we’re bumping into a few problems around casting to / from extension arrays.
- Our current default implementation for
ExtensionArray.astype
doesn’t handle targetdtype
s that are extension types (soextension_array.astype('category')
fails). At the moment, each EA will have to implement their own astyping, which is error prone and difficult. - Some EAs may be more opinionated than others about how astyping should be done. There may exist fastpaths between certain EAs, but only one of the source and destination types may know about that fast path. This (I think) calls for a kind of dispatch mechanism, where each gets to say whether they know how to handle the astyping.
I’m not sure how far down this road we want to go. Should we instead encorage users to use explicit constructors like .assign(x=pd.Categorical(...)
rather than .astype('category')
?
Issue Analytics
- State:
- Created 5 years ago
- Comments:41 (38 by maintainers)
Top Results From Across the Web
pandas.api.extensions.ExtensionArray.astype
Cast to a NumPy array or ExtensionArray with 'dtype'. Typecode or data-type to which the array is cast. Whether to copy the data,...
Read more >Data Representations - Block API - the Ray documentation
Here's an example implementation, which relies on casting the extension array to object dtype. This uses the helper method pandas.api.extensions.take() . def ...
Read more >NEP 18 — A dispatch mechanism for NumPy's high level array ...
Similarly there are many projects that build on top of the NumPy API for labeled and indexed arrays (XArray), automatic differentiation ( ...
Read more >NumPy API on TensorFlow
ND arrays can refer to buffers placed on devices other than the local CPU memory. In such cases, invoking a NumPy function will...
Read more >NDArray API — mxnet documentation
NDArray.astype, Returns a copy of the array after casting to a specified type. ... gamma, Returns the gamma function (extension of the factorial...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I agree that in general (at least the safe default casting) should be about the “same information in different format”, but I think that’s not a hard line, and you could certainly argue that integers hold the same information (given you known the resolution, tz, etc), although that’s certainly debatable 😃
But, on the specific datetime -> integer deprecation:
astype
, but then point people to theview
instead. There are usecases where you need the integers (eg if you want to do some custom rounding, or need to feed it to a system that requires unix time as integers, …), and personally I would rather have users go toastype
thanview
(becauseastype
is the more standard method for this, + if we would go with copy-on-write, this gets a bit a strange method …)In addition, using
view
will actually error for non-equal size bitwidth (astype
actually as well, but that’s something we can change, while forview
that is inherent to the method). Andview
can also silently overflow if converting to uint64, while forastype
we could check for that. In general, I seeview
as an advanced method you should only use if you really know what you are doing (and in general you don’t really need in pandas, I think)That’s my understanding of _cast_to, yes. My point was that this behavior already exists with the existing pattern with _from_sequence taking the place of _cast_to.