question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Resume from checkpoint clarification

See original GitHub issue

Describe your question

I’m currently working on a transfer learning project for ASR and I followed the related tutorial. Since I’m not able to perform an entire training in a single session, I need to use checkpoints and resume from them. The step I’m doing are the following:

  1. Save the model after training:
quartznet.save_to("./out/my_model.nemo")
  1. Restore the model when resuming:
quartznet = nemo_asr.models.EncDecCTCModel.restore_from("./out/my_model.nemo")
  1. Restore the pytorch-lightning trainer:
logger = TensorBoardLogger(
    save_dir=os.getcwd(),
    version=3,
    name='lightning_logs'
)
trainer = pl.Trainer(
    gpus=1, max_epochs=30, precision=16, amp_level='O1', checkpoint_callback=True,         
    resume_from_checkpoint='lightning_logs/version_3/checkpoints/epoch=18-step=68893.ckpt', logger=logger)
  1. Assign parameters and trainer to the model again:
# Set learning rate
params['model']['optim']['lr'] = 0.001
# Set NovoGrad optimizer betas
params['model']['optim']['betas'] = [0.95, 0.25]
# Set CosineAnnealing learning rate policy's warmup ratio
params['model']['optim']['sched']['warmup_ratio'] = 0.12
# Set training and validation labels
params['model']['train_ds']['labels'] = italian_vocabulary
params['model']['validation_ds']['labels'] = italian_vocabulary
# Set batch size
params['model']['train_ds']['batch_size'] = 16
params['model']['validation_ds']['batch_size'] = 16

# Assign trainer to the model
quartznet.set_trainer(trainer)

# Point to the data we'll use for fine-tuning as the training set
quartznet.setup_training_data(train_data_config=params['model']['train_ds'])

# Point to the new validation data for fine-tuning
quartznet.setup_validation_data(val_data_config=params['model']['validation_ds'])

# Add changes to quartznet model
quartznet.setup_optimization(optim_config=DictConfig(params['model']['optim']))
  1. Resume training:
trainer.fit(quartznet)

I would like to know if this is the correct way to resume from checkpoint with the old trainer and model trained before, or if I’m missing some steps (or making redundant ones). Before I tried to restore only the model and declare a new trainer (trainer = pl.Trainer(gpus=1, max_epochs=20, precision=16, amp_level='O1', checkpoint_callback=True)), then train for the same number of epochs of the first training session, but in the end I noticed the resulting WER was roughly the same, so I assumed the problem was that I hadn’t restored my old trainer too.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of NeMo install: pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]

Environment details

  • OS version: Ubuntu 18.04.5 LTS
  • PyTorch version: pytorch lightning 1.1.5
  • Python version: 3.6.9

Additional context

GPU model: NVIDIA Quadro RTX 4000 CUDA version: 10.1

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9

github_iconTop GitHub Comments

2reactions
titu1994commented, Feb 16, 2021

This is a great question ! Surprisingly, the QuartzNet config does not show two very nice parameters of experiment manager that are used precisely for resuming training over multiple runs - https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/citrinet/citrinet_384.yaml#L381-L382

You can add them to the quartznet config manually, or use hydra to add them with the +exp_manager.resume_if_exists=true and +exp_manager.resume_ignore_no_checkpoint=true flags. Note, when doing multiple such runs, give the experiment directory a unique name with exp_manager.name=<some unique experiment name>.

After the first run, simply rerun the training script with the same config / overrides and same experiment manager flags - and it will load up the model + its checkpoint + the optimizer state and scheduler state automatically and continue training.

0reactions
lucalazzaronicommented, Feb 18, 2021

Many thanks, you have clarified all my doubts. I think we can close the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Saving and Loading Your Model to Resume Training in PyTorch
Basically, you first initialize your model and optimizer and then update the state dictionaries using the load checkpoint function.
Read more >
resume from checkpoint in pytorchlightning - Stack Overflow
I am using google colab. I want to resume from checkpoint file but I've saved only weights. Error meesage sends me that it...
Read more >
Cell cycle checkpoints (article) | Khan Academy
This regulation makes sure that cells don't divide under unfavorable conditions (for instance, when their DNA is damaged, or when there isn't room...
Read more >
Cell Cycle Regulation by Checkpoints - PMC - NCBI - NIH
Keywords: Checkpoint, DNA damage, Cell cycle, Genome stability, Mitosis ... Some aspects of checkpoint signaling remain to be clarified or determined (known ...
Read more >
Sea of Thieves: Tall Tale Checkpoints - Rare Thief
Examples might help clarify the types of progress these checkpoints will save. Tall Tale Pages. Throughout your Tall Tale journeys, you'll often ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found