BUG: TimeDeltaIndex slicing with strings incorrectly includes too much data
See original GitHub issue-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import numpy as np
import pandas as pd
fs = 50000
i = pd.to_timedelta(np.arange(900 * fs, dtype=np.int) / fs * 1e9, unit='ns')
df = pd.DataFrame({'dummy': np.arange(len(i))}, index=i)
assert np.isclose(len(df['710s':'711s']) / fs, 2.0)
assert np.isclose(len(df['710s':'719s']) / fs, 10.0)
assert np.isclose(len(df['610s':'620s']) / fs, 11.0)
assert np.isclose(len(df['710s':'720.00000001s']) / fs, 10.00002)
assert np.isclose(len(df['710s':'720s']) / fs, 11.0) # fails! Slices 70 seconds of data??
Problem description
Slicing a dataframe with a TimeDeltaIndex with the particular right bound ‘720s’ seems to be incorrectly parsed, not returning the time slice as expected. As can be seen in the above example, other bounds work as expected, but using ‘720s’ as the right bound returns 60 more seconds of data than it should have.
Expected Output
Slicing between ‘710s’ and ‘720s’ should return 11 seconds of data, as slicing ‘610s’ and ‘620s’ does.
Output of pd.show_versions()
pandas : 1.0.3 numpy : 1.18.2 pytz : 2019.3 dateutil : 2.8.1 pip : 9.0.1 setuptools : 46.1.3 Cython : None pytest : 4.3.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 0.999999999 pymysql : 0.9.3 psycopg2 : None jinja2 : 2.10.3 IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.2.1 numexpr : None odfpy : None openpyxl : 2.4.11 pandas_gbq : None pyarrow : 0.13.0 pytables : None pytest : 4.3.1 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.12 tables : None tabulate : 0.8.6 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : None numba : 0.48.0
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:10 (5 by maintainers)
Top GitHub Comments
@mroeschke I think this is related to #21186. While I do see the point in having partial string indexing for datetimes, in the case of
TimedeltaIndex
it can create some strange and counterintuitive behaviour.Some example to explain better:
This has to do with the resolution parsed from the timedelta string. Maybe for timedelta indices it would make more sense to always use the resolution of the index? Or provide an alternative implementation (e.g.
FixedResolutionTimedeltaIndex
) allowing for this use case?I think whats happening here is that we are not actually getting the resolution of the string, just the Timedelta constructed from it. By contrast with DatetimeIndex, the parsing code also returns information about the string’s specificity.