Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Save checkpoing under the lightning_logs/version_X/ directory

See original GitHub issue

🐛 Bug

After running training the output file structure looks like

epoch=9_vl_val_loss=10.10.ckpt
lightning_logs/
├── version_0
│   ├── events.out.tfevents.1585053395.dltn.22357.0
│   └── meta_tags.csv

but the expected file structure looks like

lightning_logs/
├── version_0
│   ├── events.out.tfevents.1585053395.dltn.22357.0
│   └── meta_tags.csv  
│   └── epoch=9_vl_val_loss=10.10.ckpt

To Reproduce

Steps to reproduce the behavior:

Use PyTorch 1.4 and PL 0.7.1
Run the following snippet “checkpoint_demo.py”

Code sample

#!/usr/bin/env python
"""checkpoint_demo.py"
from torch.utils import data
import torch
import torch.nn as nn
import torch.optim as optim
from pytorch_lightning import Trainer
from pytorch_lightning import LightningModule
from pytorch_lightning.callbacks import ModelCheckpoint


class ConstantDataset(data.Dataset):
    def __len__(self): return 6
    def __getitem__(self, idx):
        c = torch.tensor(7.0, dtype=torch.float)
        return c, c

class CheckpointDemo(LightningModule):
    def __init__(self):
        super(CheckpointDemo, self).__init__()
        self.linear = nn.Linear(1, 1)

    @staticmethod
    def createModelCheckpoint():
        return ModelCheckpoint(monitor='val_loss', mode='min',
                               filepath='./{epoch}_vl_{val_loss:.2f}',
                               # filepath='{epoch}_vl_{val_loss:.2f}',  # if just filename it raises exception
                               # "/workspace/oplatek/code/.../venv/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py",
                               #     os.makedirs(self.dirpath, exist_ok=True)
                               #   File "/workspace/bin/anaconda3/lib/python3.6/os.py", line 220, in makedirs
                               #     mkdir(name, mode)
                               # FileNotFoundError: [Errno 2] No such file or directory: ''
                               save_weights_only=False,
                               verbose=True)

    def forward(self, x):
        return self.linear(x)

    def train_dataloader(self):
        return data.DataLoader(ConstantDataset(), batch_size=1)

    def val_dataloader(self):
        return data.DataLoader(ConstantDataset(), batch_size=1)

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=1.0)

    def validation_epoch_end(self, outputs):
        val_loss = torch.stack([o['val_loss'] for o in outputs]).mean()
        return {'val_loss': val_loss, 'log': {'val_loss': val_loss}}

    def training_step(self, batch, batch_idx):
        x, y = batch
        return {f'loss': torch.nn.functional.mse_loss(self.forward(x), y)}

    def validation_step(self, batch, batch_idx):
        return {f'val_loss': torch.tensor(10 + (1 / (self.current_epoch + 1)))}


if __name__ == "__main__":
    model = CheckpointDemo()
    trainer = Trainer(max_epochs=10, checkpoint_callback=CheckpointDemo.createModelCheckpoint())
    trainer.fit(model)

Issue Analytics

State:
Created 3 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

oplatekcommented, Mar 26, 2020

@TylerYep About duplicate. You are right! It is Duplicate of https://github.com/PyTorchLightning/pytorch-lightning/issues/1207

1reaction

TylerYepcommented, Mar 26, 2020

Two questions about this bug:

If ModelCheckpoint saves to the lightning_log, you will be unable to specify a way to save a file to any other location - would this be preferable? The current API allows you to specify any location to add it to, including the lightning_log/version of your choice.
The commented line is an empty string because it is missing the f in the f-string f"<content>". This is why the file cannot save. Once I add the f, epoch is not a defined variable. Does this fix that particular error?

Possible Duplicate of #1207

Top Results From Across the Web

Logging — PyTorch Lightning 1.8.5.post0 documentation

You can change the logging path using Trainer(default_root_dir="/your/path/to/save/checkpoints") without instantiating a logger. Logging from a LightningModule.

[RFC] Clean up Logger saving of Profiler output #12028 - GitHub

One issue is that for the checkpoint output we put it in a nice directory called "checkpoints", but for the profiler output it...

A Guide To Using Checkpoints — Ray 2.2.0

Experiment-level checkpoints save the experiment state. ... only that the consolidated directory including all logs and checkpoints lives on cloud storage.

Use Checkpoints in Amazon SageMaker - AWS Documentation

The SageMaker training mechanism uses training containers on Amazon EC2 instances, and the checkpoint files are saved under a local directory of the...

How to get the checkpoint path? - Trainer - PyTorch Lightning

Model checkpoint callback will save the models in a folder like this - my/path/epoch=0-step=10.ckpt . Once your training is completed you can ...