Strange behavior of CheckpointSaver with key_metric_n_saved
See original GitHub issueDescribe the bug
When setting key_metric_n_saved
argument, CheckpointSaver crashed.
I noticed that when setting key_metric_n_saved=4
, the maximum saved model is sometimes less than 4.
Some models are unexpectedly removed. It may caused the crash.
I’m considering it may be a ignite-related issue.
To Reproduce My validation handler is like:
val_handlers = [
StatsHandler(output_transform=lambda x: None),
TensorBoardStatsHandler(summary_writer=writer, tag_name="val_acc"),
CheckpointSaver(save_dir=model_dir, save_dict={"net": net}, save_key_metric=True, key_metric_n_saved=4),
MyTensorBoardImageHandler(
summary_writer=writer,
batch_transform=lambda x : (None, None),
output_transform=lambda x: x["image"],
prefix_name='Val'
)
]
Environment (please complete the following information):
- OS: Ubuntu 18.04
- Python version: 3.7
- MONAI version [e.g. git commit hash]: 1d2dce719e8adae2fba2df7b58dfb24ca4531c3a
- CUDA/cuDNN version: 10.2
- Ignite version: 0.3.0
Additional context Error log:
ERROR:ignite.engine.engine.SupervisedTrainer:Exception: [Errno 2] No such file or directory: '/homes/Data/exp/test/Models/Best/net_key_metric=0.5.pth'
Traceback (most recent call last):
File "/homes/.miniconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 942, in _internal_run
self._fire_event(Events.EPOCH_COMPLETED)
File "/homes/.miniconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 607, in _fire_event
func(self, *(event_args + args), **kwargs)
File "/homes/.miniconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 443, in wrapper
return handler(engine, *args, **kwargs)
File "/homes/Code/MONAI/monai/handlers/validation_handler.py", line 64, in __call__
self.validator.run(engine.state.epoch)
File "/homes/Code/MONAI/monai/engines/evaluator.py", line 91, in run
super().run()
File "/homes/Code/MONAI/monai/engines/workflow.py", line 157, in run
super().run(data=self.data_loader, epoch_length=len(self.data_loader))
File "/homes/.miniconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 850, in run
return self._internal_run()
File "/homes/.miniconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 952, in _internal_run
self._handle_exception(e)
File "/homes/.miniconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 714, in _handle_exception
self._fire_event(Events.EXCEPTION_RAISED, e)
File "/homes/.miniconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 607, in _fire_event
func(self, *(event_args + args), **kwargs)
File "/homes/Code/MONAI/monai/handlers/stats_handler.py", line 145, in exception_raised
raise e
File "/homes/.miniconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 942, in _internal_run
self._fire_event(Events.EPOCH_COMPLETED)
File "/homes/.miniconda3/lib/python3.7/site-packages/ignite/engine/engine.py", line 607, in _fire_event
func(self, *(event_args + args), **kwargs)
File "/homes/Code/MONAI/monai/handlers/checkpoint_saver.py", line 195, in metrics_completed
self._key_metric_checkpoint(engine, self.save_dict)
File "/homes/.miniconda3/lib/python3.7/site-packages/ignite/handlers/checkpoint.py", line 407, in __call__
super(ModelCheckpoint, self).__call__(engine)
File "/homes/.miniconda3/lib/python3.7/site-packages/ignite/handlers/checkpoint.py", line 204, in __call__
self.save_handler.remove(item.filename)
File "/homes/.miniconda3/lib/python3.7/site-packages/ignite/handlers/checkpoint.py", line 285, in remove
os.remove(path)
FileNotFoundError: [Errno 2] No such file or directory: '/homes/Data/exp/test/Models/Best/net_key_metric=0.5.pth'
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
[BUG] Unexpected behavior when CheckpointSaver's ... - GitHub
Describe the bug When the parameter max_history in the CheckpointSaver.py is set to 1, the checkpoint is saved for each epoch.
Read more >skopt.callbacks.CheckpointSaver
Save current state after each iteration with skopt.dump . Parameters. checkpoint_pathstring. location where checkpoint will be saved to;. dump_optionsstring.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@vfdev-5 @Nic-Ma Thank you for your kind support. I just fixed the issue according to https://github.com/pytorch/ignite/pull/847 as a temporary solution. Looking forward to seeing the new MONAI.
Hi @ChenglongWang ,
Now MONAI only works with ignite v0.3, but we will try to update MONAI and be compatible with ignite v0.4.2 when it’s release soon, because as @vfdev-5 said, v0.4.2 will fully support distributed training features.
Thanks.