[DOC] Understanding SlidingWindowSplitter better
See original GitHub issueDescribe the issue linked to the documentation
I am trying to understand the start_with_window
parameter for this class since it is not clear from the outcome. I created a small snippet of code to understand this better (based on the forecasting notebook in the examples folder).
The behavior with start_with_window
=True seems to be OK (this is the classic sliding window behavior) as explained here.
However the behavior with start_with_window
=False seems to be a little odd.
- Why does Fold 1 start with no samples in the training data?
- Why are there fewer training samples in the first few folds (Fold 1 to Fold 12) than the testing window length. This would not work well for classical techniques such as ARIMA since the accuracy of forecasted values beyond the length of the training data would be really bad due to missing information. Should the training window not be limited to at least the test window size (ideally much bigger).
start_with_window
Evaluation Code
Setup
y_train = y_train[:36]
len(y_train)
36
window_length=18 # How much of previous history to use to train
fh=np.arange(1, 13) # How much to forecast (from 1 to 12 or 1 year)
step_length=1 # How much to step the sliding window
start_with_window=True
# For training the regressor
initial_window = int(len(y_train) * 0.5)
cv1 = SlidingWindowSplitter(
initial_window=initial_window,
window_length=window_length,
fh=fh,
step_length=step_length,
start_with_window=True
)
Behavior
n_splits = cv1.get_n_splits(y_train)
print("-"*30)
print(f"Number of Folds: {n_splits}")
print("-"*30)
for i, (train, test) in enumerate(cv1.split(y_train)):
print(f"\nFold:{i+1}")
print(f"Train Indices: {train} \nTest Indices: {test}")
Number of Folds: 7
Fold:1 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17] Test Indices: [18 19 20 21 22 23 24 25 26 27 28 29]
Fold:2 Train Indices: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18] Test Indices: [19 20 21 22 23 24 25 26 27 28 29 30]
Fold:3 Train Indices: [ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] Test Indices: [20 21 22 23 24 25 26 27 28 29 30 31]
Fold:4 Train Indices: [ 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] Test Indices: [21 22 23 24 25 26 27 28 29 30 31 32]
Fold:5 Train Indices: [ 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21] Test Indices: [22 23 24 25 26 27 28 29 30 31 32 33]
Fold:6 Train Indices: [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22] Test Indices: [23 24 25 26 27 28 29 30 31 32 33 34]
Fold:7 Train Indices: [ 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23] Test Indices: [24 25 26 27 28 29 30 31 32 33 34 35]
start_with_window=False
# For training the regressor
initial_window = int(len(y_train) * 0.5)
cv2 = SlidingWindowSplitter(
initial_window=initial_window,
window_length=window_length,
fh=fh,
step_length=step_length,
start_with_window=False
)
Behavior
n_splits = cv2.get_n_splits(y_train)
print("-"*30)
print(f"Number of Folds: {n_splits}")
print("-"*30)
for i, (train, test) in enumerate(cv2.split(y_train)):
print(f"\nFold:{i+1}")
print(f"Train Indices: {train} \nTest Indices: {test}")
Number of Folds: 25
Fold:1 Train Indices: [] Test Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11]
Fold:2 Train Indices: [0] Test Indices: [ 1 2 3 4 5 6 7 8 9 10 11 12]
Fold:3 Train Indices: [0 1] Test Indices: [ 2 3 4 5 6 7 8 9 10 11 12 13]
Fold:4 Train Indices: [0 1 2] Test Indices: [ 3 4 5 6 7 8 9 10 11 12 13 14]
Fold:5 Train Indices: [0 1 2 3] Test Indices: [ 4 5 6 7 8 9 10 11 12 13 14 15]
Fold:6 Train Indices: [0 1 2 3 4] Test Indices: [ 5 6 7 8 9 10 11 12 13 14 15 16]
Fold:7 Train Indices: [0 1 2 3 4 5] Test Indices: [ 6 7 8 9 10 11 12 13 14 15 16 17]
Fold:8 Train Indices: [0 1 2 3 4 5 6] Test Indices: [ 7 8 9 10 11 12 13 14 15 16 17 18]
Fold:9 Train Indices: [0 1 2 3 4 5 6 7] Test Indices: [ 8 9 10 11 12 13 14 15 16 17 18 19]
Fold:10 Train Indices: [0 1 2 3 4 5 6 7 8] Test Indices: [ 9 10 11 12 13 14 15 16 17 18 19 20]
Fold:11 Train Indices: [0 1 2 3 4 5 6 7 8 9] Test Indices: [10 11 12 13 14 15 16 17 18 19 20 21]
Fold:12 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10] Test Indices: [11 12 13 14 15 16 17 18 19 20 21 22]
Fold:13 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11] Test Indices: [12 13 14 15 16 17 18 19 20 21 22 23]
Fold:14 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12] Test Indices: [13 14 15 16 17 18 19 20 21 22 23 24]
Fold:15 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13] Test Indices: [14 15 16 17 18 19 20 21 22 23 24 25]
Fold:16 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14] Test Indices: [15 16 17 18 19 20 21 22 23 24 25 26]
Fold:17 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] Test Indices: [16 17 18 19 20 21 22 23 24 25 26 27]
Fold:18 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16] Test Indices: [17 18 19 20 21 22 23 24 25 26 27 28]
Fold:19 Train Indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17] Test Indices: [18 19 20 21 22 23 24 25 26 27 28 29]
Fold:20 Train Indices: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18] Test Indices: [19 20 21 22 23 24 25 26 27 28 29 30]
Fold:21 Train Indices: [ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] Test Indices: [20 21 22 23 24 25 26 27 28 29 30 31]
Fold:22 Train Indices: [ 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] Test Indices: [21 22 23 24 25 26 27 28 29 30 31 32]
Fold:23 Train Indices: [ 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21] Test Indices: [22 23 24 25 26 27 28 29 30 31 32 33]
Fold:24 Train Indices: [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22] Test Indices: [23 24 25 26 27 28 29 30 31 32 33 34]
Fold:25 Train Indices: [ 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23] Test Indices: [24 25 26 27 28 29 30 31 32 33 34 35]
Suggest a potential alternative/fix
For start_with_window
=False,
- Can the training window be made larger than the test window in all folds
- If not, can a warning be given to the user?
- Can the default behavior of this argument be changed to
True
instead ofFalse
as that seems to be working correctly https://github.com/alan-turing-institute/sktime/blob/139b9291fb634cce367f714a6132212b0172e199/sktime/forecasting/model_selection/_split.py#L183 - And I think this is linked to https://github.com/alan-turing-institute/sktime/issues/477. Some way to visualize this would be really nice for the end user.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
@mloning . Thank for your patience while explaining these to me (and the other issues that I have opened).
Did you mean
start_with_window=False
instead of window_length=False?My main concern stems from the example in the forecasting notebook.
So the default value for
start_with_window
is False inSlidingWindowSplitter
. After the fitting has been done on the initial window, would the RandomForestRegressor not expect to have the same amount of features in X in each fold when doing temporal cross validation? But withstart_with_window
= False, the training window is expanding (not uniform size) for the first 19 folds in my original example (first post) . I would think that would cause an issue.Maybe I don’t understand the inner working completely yet, so I appreciate it if you can shed some more light on this.
Thanks again!
I think this is clearer now with PR #739 and https://github.com/alan-turing-institute/sktime/blob/main/examples/window_splitters.ipynb