Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: TimeDeltaIndex slicing with strings incorrectly includes too much data

See original GitHub issue

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
fs = 50000
i = pd.to_timedelta(np.arange(900 * fs, dtype=np.int) / fs * 1e9, unit='ns')
df = pd.DataFrame({'dummy': np.arange(len(i))}, index=i)
assert np.isclose(len(df['710s':'711s']) / fs, 2.0)
assert np.isclose(len(df['710s':'719s']) / fs, 10.0)
assert np.isclose(len(df['610s':'620s']) / fs, 11.0)
assert np.isclose(len(df['710s':'720.00000001s']) / fs, 10.00002)
assert np.isclose(len(df['710s':'720s']) / fs, 11.0)  # fails! Slices 70 seconds of data??

Problem description

Slicing a dataframe with a TimeDeltaIndex with the particular right bound ‘720s’ seems to be incorrectly parsed, not returning the time slice as expected. As can be seen in the above example, other bounds work as expected, but using ‘720s’ as the right bound returns 60 more seconds of data than it should have.

Expected Output

Slicing between ‘710s’ and ‘720s’ should return 11 seconds of data, as slicing ‘610s’ and ‘620s’ does.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Linux OS-release : 5.0.0-29-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.3 numpy : 1.18.2 pytz : 2019.3 dateutil : 2.8.1 pip : 9.0.1 setuptools : 46.1.3 Cython : None pytest : 4.3.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 0.999999999 pymysql : 0.9.3 psycopg2 : None jinja2 : 2.10.3 IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.2.1 numexpr : None odfpy : None openpyxl : 2.4.11 pandas_gbq : None pyarrow : 0.13.0 pytables : None pytest : 4.3.1 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.12 tables : None tabulate : 0.8.6 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : None numba : 0.48.0

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:10 (5 by maintainers)

Top GitHub Comments

3reactions

mattbitcommented, Apr 28, 2020

@mroeschke I think this is related to #21186. While I do see the point in having partial string indexing for datetimes, in the case of TimedeltaIndex it can create some strange and counterintuitive behaviour.

Some example to explain better:

# Create a timeseries with 10 Hz timedelta index (one sample each 0.1 s)
# i.e. index contains values ['00:00:00', '00:00:00.1', '00:00:00.2', …] and
# the series values represent the sample number
idx = pd.timedelta_range(0, '10s', freq='100ms')
ts = pd.Series(np.arange(len(idx)), index=idx)

# I want to get a specific sample, at '00:00:03'
ts.loc['3s']  # returns the value at '00:00:03' (i.e. sample 30)
assert ts.loc['3s'] == 30  # indeed

# Now I want to get all samples until at '00:00:03' 
ts.loc[:'3s']  # this returns all values until '00:00:03.90' (i.e. sample 39)
assert ts.loc[:'3s'][-1] == 30  # this fails, because the last element is not 30 but 39

df.loc[:'3.000s']  # this again returns all values until '00:00:03.90'
assert ts.loc[:'3.000s'][-1] == 30  # fails, again

df.loc[:'3.001s']  # this instead returns all values until '00:00:03'
assert ts.loc[:'3.001s'][-1] == 30  # success!

# The paradox: selecting until '3.000s' returns more than selecting until '3.001s' (!)
len(ts.loc[:'3.000s']) > len(ts.loc[:'3.001s'])  # True

# Using `pandas.Timedelta` objects solves the ambiguity
ts.loc[:pd.Timedelta('3s')]  # returns all values until '00:00:03'
ts.loc[:pd.Timedelta('3s')][-1] == 30  # True

This has to do with the resolution parsed from the timedelta string. Maybe for timedelta indices it would make more sense to always use the resolution of the index? Or provide an alternative implementation (e.g. FixedResolutionTimedeltaIndex) allowing for this use case?

1reaction

jbrockmendelcommented, Apr 13, 2022

I think whats happening here is that we are not actually getting the resolution of the string, just the Timedelta constructed from it. By contrast with DatetimeIndex, the parsing code also returns information about the string’s specificity.

Top Results From Across the Web

What's new in 1.3.0 (July 2, 2021) - Pandas

Series.loc() now raises a helpful error message when the Series has a MultiIndex and the indexer has too many dimensions (GH35349).

python-pandas-0.23.4-bp151.2.3 - SUSE Package Hub -

Series.rolling.skew() and rolling.kurt() with all equal values has floating issue (GH18044) + Bug in TimedeltaIndex subtraction could incorrectly overflow ...

python2-pandas-0.23.4-bp153.1.19 RPM for x86_64 - RPMFind

pandas is a Python package providing flexible and expressive data ... Conversion + Bug in TimedeltaIndex subtraction could incorrectly ...

The Slice Type - The Rust Programming Language

A slice is a kind of reference, so it does not have ownership. Here's a small programming ... Luckily, Rust has a solution...

Part 5 - Working with Time Series Data | ArcGIS API for Python

Pandas was developed in the context of financial modeling, so it contains an extensive set ... it is based on the more efficient...