question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add interrupt/terminate option to Trainers & Evaluators

See original GitHub issue

Is your feature request related to a problem? Please describe. It should be possible to abort/finalize a running Trainer by calling the API (rather than ctr+C). This will be helpful if the Trainer needs to be executed remotely, such as in federated learning (FL) scenarios.

Describe the solution you’d like Add abort() and finalize() functions to the Trainer class (or potentially its base class). Note, finalize() should terminate the training completely, while abort() should allow later continue of where it was aborted(), by calling run() again.

For example, an ignite-based Trainer support abort() and finalize() calls could be implemented as such (Currently used in MONAI-FL’s MonaiAlgo class; private repo - contact me if you need access)

    def abort(self):
        self.trainer.terminate()
        # save current iteration for next round
        setattr(self.trainer.state, "dataloader_iter", self.trainer._dataloader_iter)

        if self.trainer.state.iteration % self.trainer.state.epoch_length == 0:
            # if current iteration is end of 1 epoch, manually trigger epoch completed event
            self.trainer._fire_event(Events.EPOCH_COMPLETED)

    def finalize(self):
        self.trainer.terminate()

Describe alternatives you’ve considered n/a

Additional context n/a

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:29 (24 by maintainers)

github_iconTop GitHub Comments

1reaction
vfdev-5commented, Aug 23, 2022

Hi @Nic-Ma

We will try to land the code this week and it will be on master and nightly release. Next, we’ll schedule our regular 0.4.10 release where this feature will be present if tests on nightly from your side can confirm that we are good.

Thanks

1reaction
vfdev-5commented, Aug 17, 2022

@Nic-Ma allocated memory can be released (context memory maybe not):

import torch

print("-", torch.cuda.memory_allocated() / 1024 / 1024)

x = torch.rand(100, 100, 100, 100, device="cuda")
print("--", torch.cuda.memory_allocated() / 1024 / 1024)

x = None
print("---", torch.cuda.memory_allocated() / 1024 / 1024)
- 0.0
-- 382.0
--- 0.0
Read more comments on GitHub >

github_iconTop Results From Across the Web

Trainer — PyTorch Lightning 1.8.5.post0 documentation
The trainer will catch the KeyboardInterrupt and attempt a graceful shutdown, including running accelerator callback on_train_end to clean up memory. The  ...
Read more >
Trainer — transformers 4.2.0 documentation - Hugging Face
evaluate – Runs an evaluation loop and returns metrics. ... the model to make it easier to rerun an interrupted training or reuse...
Read more >
Stopping and Resuming a Tune Run - the Ray documentation
Tune first looks at the experiment-level checkpoint to find the list of trials at the time of the interruption. Tune then locates and...
Read more >
TeamSTEPPS 2.0 Course Management Guide
Train-the-Participant – Refers to the approach in which TeamSTEPPS trainers train participants/staff of a particular workspace. Course Options and Suggested ...
Read more >
The three forms of feedback: appreciation, coaching and ...
Evaluations are retrospectives, so they happen after appreciation, ... whether to continue or stop a given task based on where we are now....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found