[Question] Resume from checkpoint clarification
See original GitHub issueDescribe your question
I’m currently working on a transfer learning project for ASR and I followed the related tutorial. Since I’m not able to perform an entire training in a single session, I need to use checkpoints and resume from them. The step I’m doing are the following:
- Save the model after training:
quartznet.save_to("./out/my_model.nemo")
- Restore the model when resuming:
quartznet = nemo_asr.models.EncDecCTCModel.restore_from("./out/my_model.nemo")
- Restore the pytorch-lightning trainer:
logger = TensorBoardLogger(
save_dir=os.getcwd(),
version=3,
name='lightning_logs'
)
trainer = pl.Trainer(
gpus=1, max_epochs=30, precision=16, amp_level='O1', checkpoint_callback=True,
resume_from_checkpoint='lightning_logs/version_3/checkpoints/epoch=18-step=68893.ckpt', logger=logger)
- Assign parameters and trainer to the model again:
# Set learning rate
params['model']['optim']['lr'] = 0.001
# Set NovoGrad optimizer betas
params['model']['optim']['betas'] = [0.95, 0.25]
# Set CosineAnnealing learning rate policy's warmup ratio
params['model']['optim']['sched']['warmup_ratio'] = 0.12
# Set training and validation labels
params['model']['train_ds']['labels'] = italian_vocabulary
params['model']['validation_ds']['labels'] = italian_vocabulary
# Set batch size
params['model']['train_ds']['batch_size'] = 16
params['model']['validation_ds']['batch_size'] = 16
# Assign trainer to the model
quartznet.set_trainer(trainer)
# Point to the data we'll use for fine-tuning as the training set
quartznet.setup_training_data(train_data_config=params['model']['train_ds'])
# Point to the new validation data for fine-tuning
quartznet.setup_validation_data(val_data_config=params['model']['validation_ds'])
# Add changes to quartznet model
quartznet.setup_optimization(optim_config=DictConfig(params['model']['optim']))
- Resume training:
trainer.fit(quartznet)
I would like to know if this is the correct way to resume from checkpoint with the old trainer and model trained before, or if I’m missing some steps (or making redundant ones).
Before I tried to restore only the model and declare a new trainer (trainer = pl.Trainer(gpus=1, max_epochs=20, precision=16, amp_level='O1', checkpoint_callback=True)
), then train for the same number of epochs of the first training session, but in the end I noticed the resulting WER was roughly the same, so I assumed the problem was that I hadn’t restored my old trainer too.
Environment overview (please complete the following information)
- Environment location: Bare-metal
- Method of NeMo install:
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]
Environment details
- OS version: Ubuntu 18.04.5 LTS
- PyTorch version: pytorch lightning 1.1.5
- Python version: 3.6.9
Additional context
GPU model: NVIDIA Quadro RTX 4000 CUDA version: 10.1
Issue Analytics
- State:
- Created 3 years ago
- Comments:9
This is a great question ! Surprisingly, the QuartzNet config does not show two very nice parameters of experiment manager that are used precisely for resuming training over multiple runs - https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/citrinet/citrinet_384.yaml#L381-L382
You can add them to the quartznet config manually, or use hydra to add them with the
+exp_manager.resume_if_exists=true
and+exp_manager.resume_ignore_no_checkpoint=true
flags. Note, when doing multiple such runs, give the experiment directory a unique name withexp_manager.name=<some unique experiment name>
.After the first run, simply rerun the training script with the same config / overrides and same experiment manager flags - and it will load up the model + its checkpoint + the optimizer state and scheduler state automatically and continue training.
Many thanks, you have clarified all my doubts. I think we can close the issue.