Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DOC] Understanding SlidingWindowSplitter better

See original GitHub issue

Describe the issue linked to the documentation

I am trying to understand the start_with_window parameter for this class since it is not clear from the outcome. I created a small snippet of code to understand this better (based on the forecasting notebook in the examples folder).

The behavior with start_with_window=True seems to be OK (this is the classic sliding window behavior) as explained here.

However the behavior with start_with_window=False seems to be a little odd.

Why does Fold 1 start with no samples in the training data?
Why are there fewer training samples in the first few folds (Fold 1 to Fold 12) than the testing window length. This would not work well for classical techniques such as ARIMA since the accuracy of forecasted values beyond the length of the training data would be really bad due to missing information. Should the training window not be limited to at least the test window size (ideally much bigger).

`start_with_window` Evaluation Code

Setup

y_train = y_train[:36]
len(y_train)

36

window_length=18 # How much of previous history to use to train
fh=np.arange(1, 13)  # How much to forecast (from 1 to 12 or 1 year)
step_length=1 # How much to step the sliding window

start_with_window=True

# For training the regressor
initial_window = int(len(y_train) * 0.5)

cv1 = SlidingWindowSplitter(
    initial_window=initial_window,
    window_length=window_length,
    fh=fh,
    step_length=step_length,
    start_with_window=True
)

Behavior

n_splits = cv1.get_n_splits(y_train)
print("-"*30)
print(f"Number of Folds: {n_splits}")
print("-"*30)

for i, (train, test) in enumerate(cv1.split(y_train)):
    print(f"\nFold:{i+1}")
    print(f"Train Indices: {train} \nTest Indices:  {test}")

Number of Folds: 7

Fold:1 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17] Test Indices: [18 19 20 21 22 23 24 25 26 27 28 29]

Fold:2 Train Indices: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18] Test Indices: [19 20 21 22 23 24 25 26 27 28 29 30]

Fold:3 Train Indices: [ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] Test Indices: [20 21 22 23 24 25 26 27 28 29 30 31]

Fold:4 Train Indices: [ 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] Test Indices: [21 22 23 24 25 26 27 28 29 30 31 32]

Fold:5 Train Indices: [ 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21] Test Indices: [22 23 24 25 26 27 28 29 30 31 32 33]

Fold:6 Train Indices: [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22] Test Indices: [23 24 25 26 27 28 29 30 31 32 33 34]

Fold:7 Train Indices: [ 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23] Test Indices: [24 25 26 27 28 29 30 31 32 33 34 35]

start_with_window=False

# For training the regressor
initial_window = int(len(y_train) * 0.5)

cv2 = SlidingWindowSplitter(
    initial_window=initial_window,
    window_length=window_length,
    fh=fh,
    step_length=step_length,
    start_with_window=False
)

Behavior

n_splits = cv2.get_n_splits(y_train)
print("-"*30)
print(f"Number of Folds: {n_splits}")
print("-"*30)

for i, (train, test) in enumerate(cv2.split(y_train)):
    print(f"\nFold:{i+1}")
    print(f"Train Indices: {train} \nTest Indices:  {test}")

Number of Folds: 25

Fold:1 Train Indices: [] Test Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11]

Fold:2 Train Indices: [0] Test Indices: [ 1 2 3 4 5 6 7 8 9 10 11 12]

Fold:3 Train Indices: [0 1] Test Indices: [ 2 3 4 5 6 7 8 9 10 11 12 13]

Fold:4 Train Indices: [0 1 2] Test Indices: [ 3 4 5 6 7 8 9 10 11 12 13 14]

Fold:5 Train Indices: [0 1 2 3] Test Indices: [ 4 5 6 7 8 9 10 11 12 13 14 15]

Fold:6 Train Indices: [0 1 2 3 4] Test Indices: [ 5 6 7 8 9 10 11 12 13 14 15 16]

Fold:7 Train Indices: [0 1 2 3 4 5] Test Indices: [ 6 7 8 9 10 11 12 13 14 15 16 17]

Fold:8 Train Indices: [0 1 2 3 4 5 6] Test Indices: [ 7 8 9 10 11 12 13 14 15 16 17 18]

Fold:9 Train Indices: [0 1 2 3 4 5 6 7] Test Indices: [ 8 9 10 11 12 13 14 15 16 17 18 19]

Fold:10 Train Indices: [0 1 2 3 4 5 6 7 8] Test Indices: [ 9 10 11 12 13 14 15 16 17 18 19 20]

Fold:11 Train Indices: [0 1 2 3 4 5 6 7 8 9] Test Indices: [10 11 12 13 14 15 16 17 18 19 20 21]

Fold:12 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10] Test Indices: [11 12 13 14 15 16 17 18 19 20 21 22]

Fold:13 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11] Test Indices: [12 13 14 15 16 17 18 19 20 21 22 23]

Fold:14 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12] Test Indices: [13 14 15 16 17 18 19 20 21 22 23 24]

Fold:15 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13] Test Indices: [14 15 16 17 18 19 20 21 22 23 24 25]

Fold:16 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14] Test Indices: [15 16 17 18 19 20 21 22 23 24 25 26]

Fold:17 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] Test Indices: [16 17 18 19 20 21 22 23 24 25 26 27]

Fold:18 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16] Test Indices: [17 18 19 20 21 22 23 24 25 26 27 28]

Fold:19 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17] Test Indices: [18 19 20 21 22 23 24 25 26 27 28 29]

Fold:20 Train Indices: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18] Test Indices: [19 20 21 22 23 24 25 26 27 28 29 30]

Fold:21 Train Indices: [ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] Test Indices: [20 21 22 23 24 25 26 27 28 29 30 31]

Fold:22 Train Indices: [ 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] Test Indices: [21 22 23 24 25 26 27 28 29 30 31 32]

Fold:23 Train Indices: [ 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21] Test Indices: [22 23 24 25 26 27 28 29 30 31 32 33]

Fold:24 Train Indices: [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22] Test Indices: [23 24 25 26 27 28 29 30 31 32 33 34]

Fold:25 Train Indices: [ 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23] Test Indices: [24 25 26 27 28 29 30 31 32 33 34 35]

Suggest a potential alternative/fix

For start_with_window=False,

Can the training window be made larger than the test window in all folds
If not, can a warning be given to the user?
Can the default behavior of this argument be changed to True instead of False as that seems to be working correctly https://github.com/alan-turing-institute/sktime/blob/139b9291fb634cce367f714a6132212b0172e199/sktime/forecasting/model_selection/_split.py#L183
And I think this is linked to https://github.com/alan-turing-institute/sktime/issues/477. Some way to visualize this would be really nice for the end user.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

ngupta23commented, Dec 2, 2020

@mloning . Thank for your patience while explaining these to me (and the other issues that I have opened).

the window_length=False is needed for model evaluation in forecasting update_predict based on some test data, where the first is the training data seen in fit and the subsequent windows add more data step by step.

Did you mean start_with_window=False instead of window_length=False?

My main concern stems from the example in the forecasting notebook.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# tuning the 'n_estimator' hyperparameter of RandomForestRegressor from scikit-learn
regressor_param_grid = {"n_estimators": [100, 200, 300]}
forecaster_param_grid = {"window_length": [5,10,15,20,25]}

# create a tunnable regressor with GridSearchCV
regressor = GridSearchCV(RandomForestRegressor(), param_grid=regressor_param_grid)
forecaster = ReducedRegressionForecaster(regressor, window_length=15, strategy="recursive")

cv = SlidingWindowSplitter(initial_window=int(len(y_train) * 0.5))
gscv = ForecastingGridSearchCV(forecaster, cv=cv, param_grid=forecaster_param_grid)

gscv.fit(y_train)
y_pred = gscv.predict(fh)
plot_series(y_train, y_test, y_pred, labels=["y_train", "y_test", "y_pred"]);
smape_loss(y_test, y_pred)

So the default value for start_with_window is False in SlidingWindowSplitter. After the fitting has been done on the initial window, would the RandomForestRegressor not expect to have the same amount of features in X in each fold when doing temporal cross validation? But with start_with_window = False, the training window is expanding (not uniform size) for the first 19 folds in my original example (first post) . I would think that would cause an issue.

Maybe I don’t understand the inner working completely yet, so I appreciate it if you can shed some more light on this.

Thanks again!

0reactions

mloningcommented, Apr 26, 2021

I think this is clearer now with PR #739 and https://github.com/alan-turing-institute/sktime/blob/main/examples/window_splitters.ipynb

Top Results From Across the Web

SlidingWindowSplitter — sktime documentation

Sliding window splitter. Split time series repeatedly into a fixed-length training and test set. Test window is defined by forecasting horizons relative to...

Expanding vs sliding window splitter (window splitter ... - GitHub

As in real use cases we collect data to enlarge our training set, it would be useful to have an option on the...

Build Complex Time Series Regression Pipelines with sktime

Time series forecasting is a technique to predict one or more future values. ... SingleWindowSplitter, SlidingWindowSplitter, and ExpandingWindowSplitter.

forecasting with sktime - arXiv

We present a new open-source framework for forecasting in Python. Our framework forms part of sktime, a more general machine learning ...

Document Understanding: Document Splitting and Other ...

UiPath's Document Understanding now has support for file splitting, custom ML models, better digitization and more!