Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add interrupt/terminate option to Trainers & Evaluators

See original GitHub issue

Is your feature request related to a problem? Please describe. It should be possible to abort/finalize a running Trainer by calling the API (rather than ctr+C). This will be helpful if the Trainer needs to be executed remotely, such as in federated learning (FL) scenarios.

Describe the solution you’d like Add abort() and finalize() functions to the Trainer class (or potentially its base class). Note, finalize() should terminate the training completely, while abort() should allow later continue of where it was aborted(), by calling run() again.

For example, an ignite-based Trainer support abort() and finalize() calls could be implemented as such (Currently used in MONAI-FL’s MonaiAlgo class; private repo - contact me if you need access)

    def abort(self):
        self.trainer.terminate()
        # save current iteration for next round
        setattr(self.trainer.state, "dataloader_iter", self.trainer._dataloader_iter)

        if self.trainer.state.iteration % self.trainer.state.epoch_length == 0:
            # if current iteration is end of 1 epoch, manually trigger epoch completed event
            self.trainer._fire_event(Events.EPOCH_COMPLETED)

    def finalize(self):
        self.trainer.terminate()

Describe alternatives you’ve considered n/a

Additional context n/a

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:29 (24 by maintainers)

Top GitHub Comments

1reaction

vfdev-5commented, Aug 23, 2022

Hi @Nic-Ma

We will try to land the code this week and it will be on master and nightly release. Next, we’ll schedule our regular 0.4.10 release where this feature will be present if tests on nightly from your side can confirm that we are good.

Thanks

1reaction

vfdev-5commented, Aug 17, 2022

@Nic-Ma allocated memory can be released (context memory maybe not):

import torch

print("-", torch.cuda.memory_allocated() / 1024 / 1024)

x = torch.rand(100, 100, 100, 100, device="cuda")
print("--", torch.cuda.memory_allocated() / 1024 / 1024)

x = None
print("---", torch.cuda.memory_allocated() / 1024 / 1024)

- 0.0
-- 382.0
--- 0.0

Top Results From Across the Web

Trainer — PyTorch Lightning 1.8.5.post0 documentation

The trainer will catch the KeyboardInterrupt and attempt a graceful shutdown, including running accelerator callback on_train_end to clean up memory. The ...

Trainer — transformers 4.2.0 documentation - Hugging Face

evaluate – Runs an evaluation loop and returns metrics. ... the model to make it easier to rerun an interrupted training or reuse...

Stopping and Resuming a Tune Run - the Ray documentation

Tune first looks at the experiment-level checkpoint to find the list of trials at the time of the interruption. Tune then locates and...

TeamSTEPPS 2.0 Course Management Guide

Train-the-Participant – Refers to the approach in which TeamSTEPPS trainers train participants/staff of a particular workspace. Course Options and Suggested ...

The three forms of feedback: appreciation, coaching and ...

Evaluations are retrospectives, so they happen after appreciation, ... whether to continue or stop a given task based on where we are now....