Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Potential Leakage of Information Across Folds in Kfold.py

See original GitHub issue

🐛 Bug

I believe that there can potentially be some leakage of information across folds when changing some parameters in the Kfold.py script. Say that the user chooses to save every single checkpoint. After the first fold training finishes, the second fold uses the same checkpoint directory as the first fold. So if the second fold finishes training and the user decides to load the best checkpoint, the second fold may potentially load a checkpoint from the training of the first fold.

To Reproduce

Run the script Kfold.py

Expected behavior

We expect that the training process of multiple folds is independent of one another.

Environment

PyTorch Lightning Version: 1.6.0dev
PyTorch Version: 1.10.0+cu102
Python version: 3.7.11
OS: Linux
CUDA/cuDNN version: Using CPU
GPU models and configuration:
How you installed PyTorch (conda, pip, source):
If compiling from source, the output of torch.__config__.show():
Any other relevant information:

Additional context

I think the solution may be to clear out previous checkpoints when starting out the new fold. We would also need to reset the checkpoint states (like reset minimum validation loss when advancing to the next fold).

cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj @carmocca @justusschock

Issue Analytics

State:
Created 2 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

AlexTocommented, Apr 4, 2022

To use 1 model checkpoint per fold, here is how I did it:

In the model, log the metrics with different names for each fold, for e.g. val_loss should be f"fold_{fold}-val_loss"

def validation_step(self, batch, batch_idx):
    ...
    fold = self.trainer.fit_loop.current_fold 
    self.log(f"fold_{fold}-val_loss", loss.item(), on_step=False, on_epoch=True)
    ....

create multiple model checkpoint instances that monitor different fold val losses

model_checkpoints = [KFoldModelCheckpoint(
    filename="{" + f"fold_{f}-val_loss" + "}_{epoch}.pt",
    monitor=f"fold_{f}-val_loss",
    mode="min",
    every_n_epochs=1,
    save_top_k=3
) for f in range(num_folds)]

But note that the original ModelCheckpoint will throw an error because the model checkpoint for fold 0 can only monitor fold_0-val_loss so, during other folds, the metric fold_0-val_loss is not found. We can simply extend ModelCheckpoint to ignore folds that are not relevant

class KFoldModelCheckpoint(ModelCheckpoint):
    def _save_topk_checkpoint(self, trainer: "pl.Trainer", monitor_candidates: Dict[str, _METRIC]) -> None:
        if self.save_top_k == 0:
            return
        # validate metric
        if self.monitor is not None:
            if self.monitor not in monitor_candidates:
                if "fold" in self.monitor: # if fold specific metrics are not found in monitor_candidates, just don't do anything
                    return
                else:
                    m = (
                        f"`ModelCheckpoint(monitor={self.monitor!r})` could not find the monitored key in the returned"
                        f" metrics: {list(monitor_candidates)}."
                        f" HINT: Did you call `log({self.monitor!r}, value)` in the `LightningModule`?"
                    )
                    if trainer.fit_loop.epoch_loop.val_loop._has_run:
                        raise MisconfigurationException(m)
                    warning_cache.warn(m)
            self._save_monitor_checkpoint(trainer, monitor_candidates)
        else:
            self._save_none_monitor_checkpoint(trainer, monitor_candidates)

Now, model checkpoint for k-fold will work properly 😉

0reactions

stale[bot]commented, Jun 6, 2022

This issue has been automatically marked as stale because it hasn’t had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Top Results From Across the Web

Potential Leakage of Information Across Folds in Kfold.py

Bug I believe that there can potentially be some leakage of information across folds when changing some parameters in the Kfold.py script.

How to Avoid Data Leakage When Performing Data Preparation

The k-fold cross-validation procedure generally gives a more reliable estimate of model performance than a train-test split, although it is more ...

feature scaling - K-Fold cross validation and data leakage

A reproducible example with no data leakage: In there I'm scaling the data only with the train data on the k-fold stage

Stratified K-Fold Cross-Validation on Grouped Datasets

To use K-Fold cross-validation, we split the source dataset into K partitions. ... This approach has the advantage of avoiding potential leakage issues...

K-Fold CV on Imbalance Classification Data | Analytics Vidhya

The stratified k-fold cross validation ensures each fold's sample is randomly selected without replacement, to reflect the 1:9 ratio imbalance ...