Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: pd.array([timedelta_like_strings]) should infer TimedeltaArray

See original GitHub issue

>>> left = pd.array(['59 days', '59 days', pd.NaT]))
>>> left
<StringArray>
['59 days', '59 days', <NA>]
Length: 3, dtype: string

>>> left = pd.array(['59 days', '59 days', pd.NaT], dtype='m8[ns]')
>>> left
<TimedeltaArray>
['59 days', '59 days', NaT]
Length: 3, dtype: timedelta64[ns]

Issue Analytics

State:
Created 3 years ago
Comments:19 (19 by maintainers)

Top GitHub Comments

1reaction

jbrockmendelcommented, Oct 12, 2020

yah, i closed bc i thought there was consensus that the title of the issue was wrong; i.e. you guys convinced me.

0reactions

jorisvandenbosschecommented, Oct 13, 2020

Sorry for the confusion here 😉 (and ignore the title of the issue for a moment, that’s indeed in need of an update if we agree what this issue is about)

The only reason that I reopened this is because (I thought) our earlier discussion (@TomAugspurger see your comment above at https://github.com/pandas-dev/pandas/issues/33558#issuecomment-614096100 which says “Are we agreed that the expected dtype for these mixed cases (strings and NaT) is object?”) concluded that the current behaviour (inferring this mixed case to string) is not wanted (we want to infer mixed case to object instead). So since the example code that started this issue is not doing on master what we want it to do, it seems worth it to have an open issue about this (even though the original proposal of @jbrockmendel when opening this issue is different).

Ah, so Joris you reopened it to go the other way? To deprecate Series / Index’s behavior of inferring?

Well, I was not actaully thinking about that, but now you say it … 😉 That’s maybe indeed what we should do.

There are basically two “different” behaviours to consider (and when reopening this issue I was actually only thinking about the first):

pd.array(['59 days', pd.NaT]) -> infers string dtype
pd.Series(['59 days', pd.NaT]) -> infers timedelta64[ns] dtype

Ideally, we want to have both of those consistent, agreed? And it seems there is also agreement on inferring as object dtype? (but noting again that raising an error is also still an option, users can specify dtype=object if they really meant to create an object-dtype series)

For the first one, I think we can still simply change, because pd.array()'s behaviour is still in flux anyway (string dtype is experimental). The Series behaviour, if we agree on changing this default inference, indeed is something we need to deprecate. The question might be if we already want to deprecate this now, or have this integrated in a general move to pd.array behaviour / nullable dtypes we need to do at some point (there are other differences in behaviour between pd.Series() inference and pd.array() inference as well)

Top Results From Across the Web

pandas.array — pandas 1.5.2 documentation

Currently, pandas will infer an extension dtype for sequences of ... pandas will always return a DatetimeArray or TimedeltaArray rather than a PandasArray...