Unified Validation Scheme
See original GitHub issueIn my opinion this package still needs a unified way to evaluate DL models.
Background As everyone knows, there are usually 3 different sets: training, validation, testing. One trains on the training set, validates and tunes the model (i.e. the Hyperparameters) on the validation set and finally evaluates on the unseen test set.
As it is arguably the most popular BCI dataset I will make examples regarding the BCI competition IV 2a dataset. As a starting point I want to discuss WithinSubject Validation.
Regarding the BCI IV 2a dataset the train-test split is quite obvious: session_T
for training and session_E
for testing.
The problems come with the validation set.
Examples
-
MOABB: Here the dataset is splitted into session_T and session_E and the classifier gets trained on session_T and validated on session_E. There exists no test set (the validation set is the test set). This will lead to better results as the “test set” is used to tune the model (Hyperparamters, EarlyStopping, etc.). The final model therefore benefits from the test data during training and the final “test” result is positively biased. Further there is a 2-fold cross validation (same training is repeated with interchanged sessions).
-
braindecode Example: same as in 1. without the 2-fold cross-validation.
-
Schirrmeister et al. 2017 Appendix: Split data into train (session_T) and test (session_E). 2 training phases: a. train set is splitted into train and validation set (probably a random split?). The model is trained on the train split and the hyperparameters are tuned via the validation set. The final training loss of the best model is saved. b. The best model is trained with train and validation set (complete session_T) until the training loss reaches the previous best training loss to prevent overfitting. All results are then obtained via the unseen test set (session_E).
-
“Conventional” approach: same as in 3. but without the second training phase.
Opinion In my opinion either method 3 or 4 should be used to get reasonable results. The braindecode/BCI community would really benefit from a unified implementation of one of these (or both) methods. As method 3 adds additional complexity to method 4 it would be interesting to know, how big the performance boost of method 3 (over method 4) is (@robintibor: is it worth it?).
What is your opinion on this topic?
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:13 (1 by maintainers)
Top GitHub Comments
Hi @martinwimpff, I agree! If you need any help getting started with PR, please let me know!
Thanks for the valuable input! So to wrap this up (and to make it as simple & fast as possible):
There should be a clear separation between “normal” train_test and HP tuning. As @agramfort stated out, EarlyStopping/the number of epochs is not a big issue so we should just use a fixed number of epochs to keep everything simple.
So for “normal” training/for the final evaluation:
For HP tuning:
The best HP configuration can then be evaluated by the train_test procedure above.
These are options 1. and 3. from above but splitted into 3 separate procedures. @bruAristimunha: do you agree?