Add interrupt/terminate option to Trainers & Evaluators
See original GitHub issueIs your feature request related to a problem? Please describe. It should be possible to abort/finalize a running Trainer by calling the API (rather than ctr+C). This will be helpful if the Trainer needs to be executed remotely, such as in federated learning (FL) scenarios.
Describe the solution you’d like Add abort() and finalize() functions to the Trainer class (or potentially its base class). Note, finalize() should terminate the training completely, while abort() should allow later continue of where it was aborted(), by calling run() again.
For example, an ignite-based Trainer support abort() and finalize() calls could be implemented as such (Currently used in MONAI-FL’s MonaiAlgo class; private repo - contact me if you need access)
def abort(self):
self.trainer.terminate()
# save current iteration for next round
setattr(self.trainer.state, "dataloader_iter", self.trainer._dataloader_iter)
if self.trainer.state.iteration % self.trainer.state.epoch_length == 0:
# if current iteration is end of 1 epoch, manually trigger epoch completed event
self.trainer._fire_event(Events.EPOCH_COMPLETED)
def finalize(self):
self.trainer.terminate()
Describe alternatives you’ve considered n/a
Additional context n/a
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:29 (24 by maintainers)
Top GitHub Comments
Hi @Nic-Ma
We will try to land the code this week and it will be on master and nightly release. Next, we’ll schedule our regular 0.4.10 release where this feature will be present if tests on nightly from your side can confirm that we are good.
Thanks
@Nic-Ma allocated memory can be released (context memory maybe not):