question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: datetime ExtensionDtype do not work with DataFrame

See original GitHub issue
  • I have checked that this issue has not already been reported. (at least I couldn’t find one)

  • I have confirmed this bug exists on the latest version of pandas. (1.1.0)

  • (optional) I have confirmed this bug exists on the master branch of pandas. (934e9f840ebd2e8b5a5181b19a23e033bd3985a5)


Code Sample, a copy-pastable example

This is some high-level example that lead to the investion. It relies on rle-array (commit dfa79295a580d533ee9d2ea901e8808496dbcdc9 was used), because the pandas-provided DatetimeArray uses a NumPy dtype or DatetimeTZDtype. Both cases somewhat work (see “Problem description”).

import pandas as pd
from rle_array import RLEArray

array = RLEArray._from_sequence([], dtype="datetime64[ns]")
df = pd.DataFrame({"x": array})
Traceback (most recent call last):
  File "bug.py", line 5, in <module>
    pd.DataFrame({"x": array})
  File ".../lib/python3.8/site-packages/pandas/core/frame.py", line 467, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File ".../lib/python3.8/site-packages/pandas/core/internals/construction.py", line 283, in init_dict
    return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File ".../lib/python3.8/site-packages/pandas/core/internals/construction.py", line 93, in arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File ".../lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1650, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
  File ".../lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1703, in form_blocks
    block_type = get_block_type(v)
  File ".../lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 2672, in get_block_type
    assert not is_datetime64tz_dtype(values.dtype)
AssertionError

Problem description

See here:

https://github.com/pandas-dev/pandas/blob/934e9f840ebd2e8b5a5181b19a23e033bd3985a5/pandas/core/internals/blocks.py#L2647-L2690

datetime (and also interval) types are checked BEFORE extension types which means that extension datetime types never end up in ExtensionBlocks. The latter one would be useful if:

  • the datetime objects is not compatible with NumPy
  • the data should not be converted to to NumPy (e.g. due to compression, like in the rle-array case)

Furthermore the invariant issubclass(vtype, np.datetime64) => not is_datetime64tz_dtype(values.dtype) does NOT hold for all extension dtypes, at least not under the current implementation of is_datetime64tz_dtype:

https://github.com/pandas-dev/pandas/blob/934e9f840ebd2e8b5a5181b19a23e033bd3985a5/pandas/core/dtypes/common.py#L415-L421

Expected Output

The code example works and df._data shows that the data ends up in an ExtensionBlock.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : d9fff2792bf16178d4e450fe7384244e50635733
python           : 3.8.5.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Thu Jun 18 20:49:00 PDT 2020; root:xnu-6153.141.1~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.0
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.1.1
setuptools       : 47.1.0
Cython           : None
pytest           : 6.0.1
hypothesis       : None
sphinx           : 3.2.0
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.16.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : 0.50.1

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sbrugmancommented, Aug 17, 2020
0reactions
marco-neumann-bycommented, Aug 19, 2020

I had a look at the code in get_block_type: Re-ordering it doesn’t work trivially, because of the pandas-provided datetime extension types will otherwise end up in an ExtensionBlock which will break a lot of things. So we have the following “conflict”:

  • DatetimeArray is implemented in a way that it relies on DatetimeBlock/DatetimeTZBlock but at the same time has an extension dtype
  • external extension arrays (even when they hold datetime data) probably want to end up in ExtensionBlock

So I think either DatetimeArray needs some changes or some special handling specifically for the DatetimeArray is added to get_block_type.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Extension dtypes in pandas appear to have a bug with query
(p.s. as an entirely separate issue, passing the dtype to the pd.DataFrame constructor directly doesn't work--seems buggy). Thanks. pandas.
Read more >
Time series / date functionality — pandas 1.5.2 documentation
Series and DataFrame have extended data type support and functionality for datetime , timedelta and Period data when passed into those constructors. DateOffset ......
Read more >
What's new in 1.4.0 (January 22, 2022) - Pandas
In the first case, the result's index is not the same as the input's. ... Bug in Categorical.astype() casting datetimes and Timestamp to...
Read more >
What's new in 2.0.0 (??) - Pandas
These are bug fixes that might have notable behavior changes. ... below the lowest tested version may still work, but are not considered...
Read more >
What's new in 1.5.0 (September 19, 2022) - Pandas
Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files. df = pd.DataFrame(data={"col1": [1, 2], "col2": [3, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found