cutoff point calculation seems to be wrong in diagnostic notebook
See original GitHub issueHi there, I think there is an issue in the cutoff point date calculation or maybe the description in the documentation is not accurate. Let me elaborate from the example in the notebook section.
df = pd.read_csv('./data/example_wp_log_peyton_manning.csv')
df['ds']=pd.to_datetime(df['ds'], infer_datetime_format=True)
df.describe(include='all')
This shows that ds ranges from 2007-12-10 00:00:00 to 2016-01-20 00:00:00 with 2905 days in total. You then run:
df_cv = cross_validation(m, '365 days', initial='1825 days', period='365 days')
which generate one cutoff point at 2013-01-20 because initial = period that is fine.
Now I was expecting that the cutoff point was going to be at this date 2012-12-08:
from datetime import timedelta
cutoff_expected = df.ds.min() + timedelta(days=1825)
So in my mind we are in excess of 43 days which I can’t really figure out where they come from:
datetime.datetime(2013, 1, 20)-datetime.datetime(2012,12,8)
Let me know where my logic is failing. Cheers!
–
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
Hi again, this was a very productive conversation, thanks for taking the time to explain it. I feel much more confident now about how I am using those features! Closing this down and hopefully will be useful to other users when searching. Thanks!
These are good questions.
The fitted parameters are not copied into each CV model, just the model structure (basically settings that can be specified in the
Prophet()
constructor, plus added seasonalities/regressors/holidays). So it would copy over that I have added a monthly seasonality to the model, but would not copy over the actual parameters of that monthly seasonality (the Fourier coefficients); these would be fit from the data inside that CV fold. I hope that clarifies that. Put otherwise, all of the parameters that are fit in the Stan during model fitting are not copied but are re-fit in each CV fold.It does not do this automatically. There is a bit of a tension here where on the one hand if I’m fitting a model with yearly seasonality but have a CV fold where it was auto-disabled, that fold probably wouldn’t provide a useful estimate of the error of the model when it does have yearly seasonality enabled. But as you note, on the other hand, if I leave yearly seasonality enabled but then try to fit it on a CV fold that is only a few months, that also probably won’t provide a very useful error estimate for a model that actually has enough data to fit the yearly seasonality. As you know, we have taken the cross validation in the 2nd direction (keeping the seasonalities on if they are used in the final model), and to avoid this issue we just raise a warning if the CV fold has less data than the seasonality: https://github.com/facebook/prophet/blob/ad3832bb1957da1ba3efb4f6b0196977fcd13f06/python/fbprophet/diagnostics.py#L152-L159 So if you are doing CV on a model that has yearly seasonality, it will print out this warning if the CV settings are such that there are folds with < 1 year of data.