Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question: How to make sure testing data is not used for prediction when using future covariates?

See original GitHub issue

Hi, I want to make sure I understand everything correctly and that I’m not accidentally using my testing data to make predictions. I read the documentation but did not find an example that leaves no doubt.

I have the following weekly series:

entire_series: 2018-01-01 - 2020-12-31 training_series: 2018-01-01 - 2019-12-31 testing_series: 2020-01-01 - 2020-12-31

I split those series in target series (including timestamp and target value) and covariate series (including the covariates). The covariates are known in advance.

rnn_model = RNNModel(input_chunk_length=52, training_length=80, n_rnn_layers=2)

rnn_model.fit(series=target_series_train, future_covariates=covariates_series_train, epochs=100)

When I make predictions, I want to make sure I am not using the target of my test set (i.e. make all the predictions for 2020 at the first day of the year) but I use all the covariates of the test set for 2020.

predictions = rnn_model.predict(n=52, future_covariates=covariates_series)

Any confirmation or clarificiation is highly appreciated.

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

dennisbadercommented, Nov 25, 2021

Hi @pabuta88 and thanks for writing.

The predictions will start one time step after the end of your target_series_train.

You can check the start time and end time of a TimeSeries with series.start_time(), series.end_time().

So if your target_series_train ends in the last week of 2019, the prediction will start in the first week of 2020.

According to your input_chunk_length, the lookback window at prediction is 52 time steps and the forecast horizon is n=52.

Data used from target train series: the last 52 time steps
Data used from covariates: the values at the same 52 past time steps from previous point + the next n=52 time steps.

So your future covariates at prediction time must include these 52+52 time steps. Btw: you can use the entire covariate_series (without splitting into train and test series) for both training and prediction. The slicing of relevant covariates is done internally by the models.

0reactions

hrzncommented, Sep 11, 2022

@buddih09 Usually when talking about RNN’s or nn’s in general, a higher training length will result in better results but can also result in overfitting leaving the model useless. But, the same can be said about underfitting if the training length isn’t long enough for the model to capture a meaningful relationship. In my personal projects my training length will vary depending on use case, but generally I like to stick around 60% training data, and 40% test data; this allows models to determine a relationship while still leaving a big portion to test on and ascertain results. I know this didn’t exactly answer your question, but I hope it helps.

I think there’s a misunderstanding here. The training_length here refers to a number of time steps of a time series, not to a train/test split. @buddih09 With RNNs, input_chunk_length is the number of time steps fed into the network before it emits forecasts of the target. Then train_length (which must be larger that input_chunk_length) determines the total number of steps the RNN module is trained on. This is a hyper-parameter that would need to be tuned in many cases.