question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

from_pretrained() does not update configuration in exp_manager

See original GitHub issue

Describe the bug

When fine tuning from a NeMo model (ex. stt_en_cn1024), the exp_manager’s cfg is not updated properly. I can see that in my run the model uses one config, but WandB reports another.

This issue did not occur in v1.4.0 and happened after I upgraded to v1.5.0. Maybe it has to do with order of operations? See below.

Steps/Code to reproduce bug

import pytorch_lightning as pl
from nemo.collections.asr.models import EncDecCTCModelBPE
from nemo.core.config import hydra_runner
from nemo.utils.exp_manager import exp_manager

@hydra_runner(config_path="conf/citrinet/", config_name="config")
def main(cfg):
    trainer = pl.Trainer(**cfg.trainer)
    log_dir = exp_manager(trainer, cfg.get("exp_manager", None))
    asr_model = EncDecCTCModelBPE.from_pretrained(model_name=cfg.init_from_pretrained_model)
    asr_model.encoder.unfreeze()
    asr_model.change_vocabulary(
      new_tokenizer_dir=cfg.model.tokenizer.dir,
      new_tokenizer_type=cfg.model.tokenizer.type
    )
    asr_model.setup_optimization(cfg.model.optim)
    asr_model.setup_training_data(cfg.model.train_ds)
    asr_model.setup_multiple_validation_data(cfg.model.validation_ds)
    asr_model.spec_augmentation = asr_model.from_config_dict(cfg.model.spec_augment)
    asr_model.set_trainer(trainer)
    trainer.fit(asr_model)

Expected behavior

WandB cfg should display the proper config (Pastebin of the WandB config)

Environment overview (please complete the following information)

  • Environment location: Docker (nvcr.io/nvidia/pytorch:21.10-py3) on AWS EC2 using docker run -it bash <image>
  • Method of NeMo install: pip install nemo_toolkit[asr]==1.5.1

Additional context

GPU model: V100 Nvidia driver: 460

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:16 (9 by maintainers)

github_iconTop GitHub Comments

3reactions
titu1994commented, Jan 28, 2022

Thanks for digging into this ! This is very interesting “bug” - it affects cases where you load a model to fine-tune, modify config then save and retrain. It’s not an invalid usecase, we should at least notify PTL team of this.

On the Nemo side I patch up a solution for PTL 1.6 (inside of the model.cfg setter) and await PTLs official solution for Nemo 1.7 release.

Setup metric is kinda niche - you’ll normally never modify metrics. ASR is weird where the domain changes quite drastically (eng to mandarin for example) and so ofc the calculation metric needs to change entirely (wer to cer). NLP / NMT has a universal metric like accuracy or BLEU which works without modifications. I’ll discuss with the ASR team if they thing a specialized setup_metric() would be worth it. Thanks for the feedback !

2reactions
titu1994commented, Dec 24, 2021

Oh I think I got the issue, the config passed to the main method is a new config, but the model is restored by the old config embedded in the Nemo file - you should be able to do model.cfg = cfg before your train step to correctly update the internal config of the model. That’s a good catch, I believe only one of the notebooks shows this. Might as well add to the documentation.

Still, trainer should be set before calling any other function, or passed into the restore_from or from_pretrained call itself

Read more comments on GitHub >

github_iconTop Results From Across the Web

ManageEngine OpManager - Troubleshooting Guide
OpManager troubleshooting guide helps you resolve the common problems that you might encounter when using OpManager.
Read more >
Update from 123329 to 124006 fails - ManageEngine Pitstop
Hi, After upgrading via UpdateManager.bat from 12.3.329 to 12.4.006, it does not allow the ManageEngine OpManager service to run.Windows display a alert. I....
Read more >
ManageEngine OpManager - Troubleshooting Guide
The DNS Server does not exist. Solution. Ensure that the DNS Server is reachable and configure the DNS Server address properly. Mapping.
Read more >
Configuring System Settings | OpManager Help
Learn how to configure system settings in OpManager | The following system settings can be enabled/disabled by the users | OpManager Help.
Read more >
OpManager best prac ces - Table of Contents - ManageEngine
Configure the relevant creden als. ○ Define Device Templates and monitors. If your device type is not found in the supported devices.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found