Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cutoff point calculation seems to be wrong in diagnostic notebook

See original GitHub issue

Hi there, I think there is an issue in the cutoff point date calculation or maybe the description in the documentation is not accurate. Let me elaborate from the example in the notebook section.

df = pd.read_csv('./data/example_wp_log_peyton_manning.csv')
df['ds']=pd.to_datetime(df['ds'], infer_datetime_format=True)  
df.describe(include='all')

This shows that ds ranges from 2007-12-10 00:00:00 to 2016-01-20 00:00:00 with 2905 days in total. You then run:

df_cv = cross_validation(m, '365 days', initial='1825 days', period='365 days')

which generate one cutoff point at 2013-01-20 because initial = period that is fine.

Now I was expecting that the cutoff point was going to be at this date 2012-12-08:

from datetime import timedelta
cutoff_expected = df.ds.min() + timedelta(days=1825)

So in my mind we are in excess of 43 days which I can’t really figure out where they come from:

datetime.datetime(2013, 1, 20)-datetime.datetime(2012,12,8)

Let me know where my logic is failing. Cheers!

–

Issue Analytics

State:
Created 3 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

robomoticcommented, Nov 13, 2020

Hi again, this was a very productive conversation, thanks for taking the time to explain it. I feel much more confident now about how I am using those features! Closing this down and hopefully will be useful to other users when searching. Thanks!

0reactions

blethamcommented, Nov 12, 2020

These are good questions.

The fitted parameters are not copied into each CV model, just the model structure (basically settings that can be specified in the Prophet() constructor, plus added seasonalities/regressors/holidays). So it would copy over that I have added a monthly seasonality to the model, but would not copy over the actual parameters of that monthly seasonality (the Fourier coefficients); these would be fit from the data inside that CV fold. I hope that clarifies that. Put otherwise, all of the parameters that are fit in the Stan during model fitting are not copied but are re-fit in each CV fold.
It does not do this automatically. There is a bit of a tension here where on the one hand if I’m fitting a model with yearly seasonality but have a CV fold where it was auto-disabled, that fold probably wouldn’t provide a useful estimate of the error of the model when it does have yearly seasonality enabled. But as you note, on the other hand, if I leave yearly seasonality enabled but then try to fit it on a CV fold that is only a few months, that also probably won’t provide a very useful error estimate for a model that actually has enough data to fit the yearly seasonality. As you know, we have taken the cross validation in the 2nd direction (keeping the seasonalities on if they are used in the final model), and to avoid this issue we just raise a warning if the CV fold has less data than the seasonality: https://github.com/facebook/prophet/blob/ad3832bb1957da1ba3efb4f6b0196977fcd13f06/python/fbprophet/diagnostics.py#L152-L159 So if you are doing CV on a model that has yearly seasonality, it will print out this warning if the CV settings are such that there are folds with < 1 year of data.

Top Results From Across the Web

5.2 Cutoff Point and Its Effects on Sensitivity and Specificity

With a cutoff point of 90 mmHg, we will classify some nonhypertensive individuals as hypertensive, and these will be false positives. We will...

On determining the most appropriate test cut-off value - NCBI

There are several criteria for determination of the most appropriate cut-off value in a diagnostic test with continuous results.

Guide to Confusion Matrices & Classification Performance ...

In this article, we will explore confusion matrices and how they can be used to determine performance metrics in machine learning classification problems....

What Is Precision & Recall? Use in Classification Models

With a threshold of 1.0, we would be in the lower left of the graph because we identify no data points as positives,...

An investigation of the false discovery rate and the ... - Journals

From this point of view, what matters is the probability that, ... that we need to specify in order to calculate the false...