Save checkpoing under the lightning_logs/version_X/ directory
See original GitHub issueπ Bug
After running training the output file structure looks like
epoch=9_vl_val_loss=10.10.ckpt
lightning_logs/
βββ version_0
β βββ events.out.tfevents.1585053395.dltn.22357.0
β βββ meta_tags.csv
but the expected file structure looks like
lightning_logs/
βββ version_0
β βββ events.out.tfevents.1585053395.dltn.22357.0
β βββ meta_tags.csv
β βββ epoch=9_vl_val_loss=10.10.ckpt
To Reproduce
Steps to reproduce the behavior:
- Use PyTorch 1.4 and PL 0.7.1
- Run the following snippet βcheckpoint_demo.pyβ
Code sample
#!/usr/bin/env python
"""checkpoint_demo.py"
from torch.utils import data
import torch
import torch.nn as nn
import torch.optim as optim
from pytorch_lightning import Trainer
from pytorch_lightning import LightningModule
from pytorch_lightning.callbacks import ModelCheckpoint
class ConstantDataset(data.Dataset):
def __len__(self): return 6
def __getitem__(self, idx):
c = torch.tensor(7.0, dtype=torch.float)
return c, c
class CheckpointDemo(LightningModule):
def __init__(self):
super(CheckpointDemo, self).__init__()
self.linear = nn.Linear(1, 1)
@staticmethod
def createModelCheckpoint():
return ModelCheckpoint(monitor='val_loss', mode='min',
filepath='./{epoch}_vl_{val_loss:.2f}',
# filepath='{epoch}_vl_{val_loss:.2f}', # if just filename it raises exception
# "/workspace/oplatek/code/.../venv/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py",
# os.makedirs(self.dirpath, exist_ok=True)
# File "/workspace/bin/anaconda3/lib/python3.6/os.py", line 220, in makedirs
# mkdir(name, mode)
# FileNotFoundError: [Errno 2] No such file or directory: ''
save_weights_only=False,
verbose=True)
def forward(self, x):
return self.linear(x)
def train_dataloader(self):
return data.DataLoader(ConstantDataset(), batch_size=1)
def val_dataloader(self):
return data.DataLoader(ConstantDataset(), batch_size=1)
def configure_optimizers(self):
return optim.Adam(self.parameters(), lr=1.0)
def validation_epoch_end(self, outputs):
val_loss = torch.stack([o['val_loss'] for o in outputs]).mean()
return {'val_loss': val_loss, 'log': {'val_loss': val_loss}}
def training_step(self, batch, batch_idx):
x, y = batch
return {f'loss': torch.nn.functional.mse_loss(self.forward(x), y)}
def validation_step(self, batch, batch_idx):
return {f'val_loss': torch.tensor(10 + (1 / (self.current_epoch + 1)))}
if __name__ == "__main__":
model = CheckpointDemo()
trainer = Trainer(max_epochs=10, checkpoint_callback=CheckpointDemo.createModelCheckpoint())
trainer.fit(model)
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Logging β PyTorch Lightning 1.8.5.post0 documentation
You can change the logging path using Trainer(default_root_dir="/your/path/to/save/checkpoints") without instantiating a logger. Logging from a LightningModule.
Read more >[RFC] Clean up Logger saving of Profiler output #12028 - GitHub
One issue is that for the checkpoint output we put it in a nice directory called "checkpoints", but for the profiler output it...
Read more >A Guide To Using Checkpoints β Ray 2.2.0
Experiment-level checkpoints save the experiment state. ... only that the consolidated directory including all logs and checkpoints lives on cloud storage.
Read more >Use Checkpoints in Amazon SageMaker - AWS Documentation
The SageMaker training mechanism uses training containers on Amazon EC2 instances, and the checkpoint files are saved under a local directory of the...
Read more >How to get the checkpoint path? - Trainer - PyTorch Lightning
Model checkpoint callback will save the models in a folder like this - my/path/epoch=0-step=10.ckpt . Once your training is completed you canΒ ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@TylerYep About duplicate. You are right! It is Duplicate of https://github.com/PyTorchLightning/pytorch-lightning/issues/1207
Two questions about this bug:
If ModelCheckpoint saves to the lightning_log, you will be unable to specify a way to save a file to any other location - would this be preferable? The current API allows you to specify any location to add it to, including the lightning_log/version of your choice.
The commented line is an empty string because it is missing the f in the f-string f"<content>". This is why the file cannot save. Once I add the f, epoch is not a defined variable. Does this fix that particular error?
Possible Duplicate of #1207