question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: TimeDeltaIndex slicing with strings incorrectly includes too much data

See original GitHub issue
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
fs = 50000
i = pd.to_timedelta(np.arange(900 * fs, dtype=np.int) / fs * 1e9, unit='ns')
df = pd.DataFrame({'dummy': np.arange(len(i))}, index=i)
assert np.isclose(len(df['710s':'711s']) / fs, 2.0)
assert np.isclose(len(df['710s':'719s']) / fs, 10.0)
assert np.isclose(len(df['610s':'620s']) / fs, 11.0)
assert np.isclose(len(df['710s':'720.00000001s']) / fs, 10.00002)
assert np.isclose(len(df['710s':'720s']) / fs, 11.0)  # fails! Slices 70 seconds of data??

Problem description

Slicing a dataframe with a TimeDeltaIndex with the particular right bound ‘720s’ seems to be incorrectly parsed, not returning the time slice as expected. As can be seen in the above example, other bounds work as expected, but using ‘720s’ as the right bound returns 60 more seconds of data than it should have.

Expected Output

Slicing between ‘710s’ and ‘720s’ should return 11 seconds of data, as slicing ‘610s’ and ‘620s’ does.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Linux OS-release : 5.0.0-29-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.3 numpy : 1.18.2 pytz : 2019.3 dateutil : 2.8.1 pip : 9.0.1 setuptools : 46.1.3 Cython : None pytest : 4.3.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 0.999999999 pymysql : 0.9.3 psycopg2 : None jinja2 : 2.10.3 IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.2.1 numexpr : None odfpy : None openpyxl : 2.4.11 pandas_gbq : None pyarrow : 0.13.0 pytables : None pytest : 4.3.1 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.12 tables : None tabulate : 0.8.6 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : None numba : 0.48.0

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
mattbitcommented, Apr 28, 2020

@mroeschke I think this is related to #21186. While I do see the point in having partial string indexing for datetimes, in the case of TimedeltaIndex it can create some strange and counterintuitive behaviour.

Some example to explain better:

# Create a timeseries with 10 Hz timedelta index (one sample each 0.1 s)
# i.e. index contains values ['00:00:00', '00:00:00.1', '00:00:00.2', …] and
# the series values represent the sample number
idx = pd.timedelta_range(0, '10s', freq='100ms')
ts = pd.Series(np.arange(len(idx)), index=idx)

# I want to get a specific sample, at '00:00:03'
ts.loc['3s']  # returns the value at '00:00:03' (i.e. sample 30)
assert ts.loc['3s'] == 30  # indeed

# Now I want to get all samples until at '00:00:03' 
ts.loc[:'3s']  # this returns all values until '00:00:03.90' (i.e. sample 39)
assert ts.loc[:'3s'][-1] == 30  # this fails, because the last element is not 30 but 39

df.loc[:'3.000s']  # this again returns all values until '00:00:03.90'
assert ts.loc[:'3.000s'][-1] == 30  # fails, again

df.loc[:'3.001s']  # this instead returns all values until '00:00:03'
assert ts.loc[:'3.001s'][-1] == 30  # success!

# The paradox: selecting until '3.000s' returns more than selecting until '3.001s' (!)
len(ts.loc[:'3.000s']) > len(ts.loc[:'3.001s'])  # True

# Using `pandas.Timedelta` objects solves the ambiguity
ts.loc[:pd.Timedelta('3s')]  # returns all values until '00:00:03'
ts.loc[:pd.Timedelta('3s')][-1] == 30  # True

This has to do with the resolution parsed from the timedelta string. Maybe for timedelta indices it would make more sense to always use the resolution of the index? Or provide an alternative implementation (e.g. FixedResolutionTimedeltaIndex) allowing for this use case?

1reaction
jbrockmendelcommented, Apr 13, 2022

I think whats happening here is that we are not actually getting the resolution of the string, just the Timedelta constructed from it. By contrast with DatetimeIndex, the parsing code also returns information about the string’s specificity.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What's new in 1.3.0 (July 2, 2021) - Pandas
Series.loc() now raises a helpful error message when the Series has a MultiIndex and the indexer has too many dimensions (GH35349).
Read more >
python-pandas-0.23.4-bp151.2.3 - SUSE Package Hub -
Series.rolling.skew() and rolling.kurt() with all equal values has floating issue (GH18044) + Bug in TimedeltaIndex subtraction could incorrectly overflow ...
Read more >
python2-pandas-0.23.4-bp153.1.19 RPM for x86_64 - RPMFind
pandas is a Python package providing flexible and expressive data ... Conversion + Bug in TimedeltaIndex subtraction could incorrectly ...
Read more >
The Slice Type - The Rust Programming Language
A slice is a kind of reference, so it does not have ownership. Here's a small programming ... Luckily, Rust has a solution...
Read more >
Part 5 - Working with Time Series Data | ArcGIS API for Python
Pandas was developed in the context of financial modeling, so it contains an extensive set ... it is based on the more efficient...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found