question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] TimeGapSplit mutates the data

See original GitHub issue

TimeGapSplit mutates the pandas.DataFrame it operates on if the date column is not a datetime type. For example:

from pandas.api.types import (
    is_object_dtype,
    is_datetime64_any_dtype
)

date_range = pd.date_range(start='1/1/2018', end='1/30/2018')
dates = [date.strftime('%m-%d-%Y') for date in date_range]

df = (
    pd.DataFrame(
        data=np.random.randint(0, 30, size=(30, 4)),
        columns=list('ABCy')
    )
    .assign(
        date=dates
    )
)

assert is_object_dtype(df['date'])

cv = TimeGapSplit(
    df=df,
    date_col='date',
    train_duration=timedelta(days=3),
    valid_duration=timedelta(days=1),
)

assert is_datetime64_any_dtype(df['date'])

Is this desirable behavior?

Possible remedies are:

  • Only accept the datetime type
  • Accept str type, but make a copy and leave the pandas.DataFrame as is.

Happy to hear your thoughts @kayhoogland @stephanecollot @koaning

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
koaningcommented, Sep 19, 2019

copy feels like a safer option for now. it allows for flexibility in the future.

0reactions
stephanecollotcommented, Oct 28, 2019

@koaning can you close?

Read more comments on GitHub >

github_iconTop Results From Across the Web

A more efficient way to split timeseries data (pd.Series) at gaps?
I am trying to split a pd.Series with sorted dates that have sometimes gaps between them that are bigger than the normal ones....
Read more >
Let's clear up the confusion around the slice( ), splice( ), & split ...
This usage is valid in JavaScript. An array with different data types: string, numbers, and a boolean. Slice ( ). The slice( )...
Read more >
Subscriptions - Apollo GraphQL Docs
Subscriptions are useful for notifying your client in real time about changes to back-end data, such as the creation of a new object...
Read more >
The Molecular Clock and Estimating Species Divergence
The molecular clock hypothesis states that DNA and protein sequences evolve at a rate that is relatively constant over time and among different...
Read more >
How to use the timeSplitter
An alternative is to split the data into a few intervals, select one interval at the time and perform separate models on each....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found