Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Resume from checkpoint clarification

See original GitHub issue

Describe your question

I’m currently working on a transfer learning project for ASR and I followed the related tutorial. Since I’m not able to perform an entire training in a single session, I need to use checkpoints and resume from them. The step I’m doing are the following:

Save the model after training:

quartznet.save_to("./out/my_model.nemo")

Restore the model when resuming:

quartznet = nemo_asr.models.EncDecCTCModel.restore_from("./out/my_model.nemo")

Restore the pytorch-lightning trainer:

logger = TensorBoardLogger(
    save_dir=os.getcwd(),
    version=3,
    name='lightning_logs'
)
trainer = pl.Trainer(
    gpus=1, max_epochs=30, precision=16, amp_level='O1', checkpoint_callback=True,         
    resume_from_checkpoint='lightning_logs/version_3/checkpoints/epoch=18-step=68893.ckpt', logger=logger)

Assign parameters and trainer to the model again:

# Set learning rate
params['model']['optim']['lr'] = 0.001
# Set NovoGrad optimizer betas
params['model']['optim']['betas'] = [0.95, 0.25]
# Set CosineAnnealing learning rate policy's warmup ratio
params['model']['optim']['sched']['warmup_ratio'] = 0.12
# Set training and validation labels
params['model']['train_ds']['labels'] = italian_vocabulary
params['model']['validation_ds']['labels'] = italian_vocabulary
# Set batch size
params['model']['train_ds']['batch_size'] = 16
params['model']['validation_ds']['batch_size'] = 16

# Assign trainer to the model
quartznet.set_trainer(trainer)

# Point to the data we'll use for fine-tuning as the training set
quartznet.setup_training_data(train_data_config=params['model']['train_ds'])

# Point to the new validation data for fine-tuning
quartznet.setup_validation_data(val_data_config=params['model']['validation_ds'])

# Add changes to quartznet model
quartznet.setup_optimization(optim_config=DictConfig(params['model']['optim']))

Resume training:

trainer.fit(quartznet)

I would like to know if this is the correct way to resume from checkpoint with the old trainer and model trained before, or if I’m missing some steps (or making redundant ones). Before I tried to restore only the model and declare a new trainer (trainer = pl.Trainer(gpus=1, max_epochs=20, precision=16, amp_level='O1', checkpoint_callback=True)), then train for the same number of epochs of the first training session, but in the end I noticed the resulting WER was roughly the same, so I assumed the problem was that I hadn’t restored my old trainer too.

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of NeMo install: pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]

Environment details

OS version: Ubuntu 18.04.5 LTS
PyTorch version: pytorch lightning 1.1.5
Python version: 3.6.9

Additional context

GPU model: NVIDIA Quadro RTX 4000 CUDA version: 10.1

Issue Analytics

State:
Created 3 years ago
Comments:9

Top GitHub Comments

2reactions

titu1994commented, Feb 16, 2021

This is a great question ! Surprisingly, the QuartzNet config does not show two very nice parameters of experiment manager that are used precisely for resuming training over multiple runs - https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/citrinet/citrinet_384.yaml#L381-L382

You can add them to the quartznet config manually, or use hydra to add them with the +exp_manager.resume_if_exists=true and +exp_manager.resume_ignore_no_checkpoint=true flags. Note, when doing multiple such runs, give the experiment directory a unique name with exp_manager.name=<some unique experiment name>.

After the first run, simply rerun the training script with the same config / overrides and same experiment manager flags - and it will load up the model + its checkpoint + the optimizer state and scheduler state automatically and continue training.

0reactions

lucalazzaronicommented, Feb 18, 2021

Many thanks, you have clarified all my doubts. I think we can close the issue.

Top Results From Across the Web

Saving and Loading Your Model to Resume Training in PyTorch

Basically, you first initialize your model and optimizer and then update the state dictionaries using the load checkpoint function.

resume from checkpoint in pytorchlightning - Stack Overflow

I am using google colab. I want to resume from checkpoint file but I've saved only weights. Error meesage sends me that it...

Cell cycle checkpoints (article) | Khan Academy

This regulation makes sure that cells don't divide under unfavorable conditions (for instance, when their DNA is damaged, or when there isn't room...

Cell Cycle Regulation by Checkpoints - PMC - NCBI - NIH

Keywords: Checkpoint, DNA damage, Cell cycle, Genome stability, Mitosis ... Some aspects of checkpoint signaling remain to be clarified or determined (known ...

Sea of Thieves: Tall Tale Checkpoints - Rare Thief

Examples might help clarify the types of progress these checkpoints will save. Tall Tale Pages. Throughout your Tall Tale journeys, you'll often ...