question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Save checkpoing under the lightning_logs/version_X/ directory

See original GitHub issue

πŸ› Bug

After running training the output file structure looks like

epoch=9_vl_val_loss=10.10.ckpt
lightning_logs/
β”œβ”€β”€ version_0
β”‚   β”œβ”€β”€ events.out.tfevents.1585053395.dltn.22357.0
β”‚   └── meta_tags.csv

but the expected file structure looks like

lightning_logs/
β”œβ”€β”€ version_0
β”‚   β”œβ”€β”€ events.out.tfevents.1585053395.dltn.22357.0
β”‚   └── meta_tags.csv  
β”‚   └── epoch=9_vl_val_loss=10.10.ckpt

To Reproduce

Steps to reproduce the behavior:

  1. Use PyTorch 1.4 and PL 0.7.1
  2. Run the following snippet β€œcheckpoint_demo.py”

Code sample

#!/usr/bin/env python
"""checkpoint_demo.py"
from torch.utils import data
import torch
import torch.nn as nn
import torch.optim as optim
from pytorch_lightning import Trainer
from pytorch_lightning import LightningModule
from pytorch_lightning.callbacks import ModelCheckpoint


class ConstantDataset(data.Dataset):
    def __len__(self): return 6
    def __getitem__(self, idx):
        c = torch.tensor(7.0, dtype=torch.float)
        return c, c

class CheckpointDemo(LightningModule):
    def __init__(self):
        super(CheckpointDemo, self).__init__()
        self.linear = nn.Linear(1, 1)

    @staticmethod
    def createModelCheckpoint():
        return ModelCheckpoint(monitor='val_loss', mode='min',
                               filepath='./{epoch}_vl_{val_loss:.2f}',
                               # filepath='{epoch}_vl_{val_loss:.2f}',  # if just filename it raises exception
                               # "/workspace/oplatek/code/.../venv/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py",
                               #     os.makedirs(self.dirpath, exist_ok=True)
                               #   File "/workspace/bin/anaconda3/lib/python3.6/os.py", line 220, in makedirs
                               #     mkdir(name, mode)
                               # FileNotFoundError: [Errno 2] No such file or directory: ''
                               save_weights_only=False,
                               verbose=True)

    def forward(self, x):
        return self.linear(x)

    def train_dataloader(self):
        return data.DataLoader(ConstantDataset(), batch_size=1)

    def val_dataloader(self):
        return data.DataLoader(ConstantDataset(), batch_size=1)

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=1.0)

    def validation_epoch_end(self, outputs):
        val_loss = torch.stack([o['val_loss'] for o in outputs]).mean()
        return {'val_loss': val_loss, 'log': {'val_loss': val_loss}}

    def training_step(self, batch, batch_idx):
        x, y = batch
        return {f'loss': torch.nn.functional.mse_loss(self.forward(x), y)}

    def validation_step(self, batch, batch_idx):
        return {f'val_loss': torch.tensor(10 + (1 / (self.current_epoch + 1)))}


if __name__ == "__main__":
    model = CheckpointDemo()
    trainer = Trainer(max_epochs=10, checkpoint_callback=CheckpointDemo.createModelCheckpoint())
    trainer.fit(model)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
oplatekcommented, Mar 26, 2020

@TylerYep About duplicate. You are right! It is Duplicate of https://github.com/PyTorchLightning/pytorch-lightning/issues/1207

1reaction
TylerYepcommented, Mar 26, 2020

Two questions about this bug:

  1. If ModelCheckpoint saves to the lightning_log, you will be unable to specify a way to save a file to any other location - would this be preferable? The current API allows you to specify any location to add it to, including the lightning_log/version of your choice.

  2. The commented line is an empty string because it is missing the f in the f-string f"<content>". This is why the file cannot save. Once I add the f, epoch is not a defined variable. Does this fix that particular error?

Possible Duplicate of #1207

Read more comments on GitHub >

github_iconTop Results From Across the Web

Logging β€” PyTorch Lightning 1.8.5.post0 documentation
You can change the logging path using Trainer(default_root_dir="/your/path/to/save/checkpoints") without instantiating a logger. Logging from a LightningModule.
Read more >
[RFC] Clean up Logger saving of Profiler output #12028 - GitHub
One issue is that for the checkpoint output we put it in a nice directory called "checkpoints", but for the profiler output it...
Read more >
A Guide To Using Checkpoints β€” Ray 2.2.0
Experiment-level checkpoints save the experiment state. ... only that the consolidated directory including all logs and checkpoints lives on cloud storage.
Read more >
Use Checkpoints in Amazon SageMaker - AWS Documentation
The SageMaker training mechanism uses training containers on Amazon EC2 instances, and the checkpoint files are saved under a local directory of the...
Read more >
How to get the checkpoint path? - Trainer - PyTorch Lightning
Model checkpoint callback will save the models in a folder like this - my/path/epoch=0-step=10.ckpt . Once your training is completed you canΒ ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found