question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No longer able to load provided OPT checkpoint after recent changes

See original GitHub issue

🐛 Bug

No longer able to load provided OPT checkpoint after recent changes

To Reproduce

Edit metaseq/service/constants.py as before, in my case:

MAX_SEQ_LEN = 2048
BATCH_SIZE = 2048  # silly high bc we dynamically batch by MAX_BATCH_TOKENS
MAX_BATCH_TOKENS = 3072
DEFAULT_PORT = 6010
MODEL_PARALLEL = 1
TOTAL_WORLD_SIZE = 1
MAX_BEAM = 16

try:
    # internal logic denoting where checkpoints are in meta infrastructure
    from metaseq_internal.constants import CHECKPOINT_FOLDER
except ImportError:
    CHECKPOINT_FOLDER = "/home/jason_chou/redspot_home/350m/"
(...)

where

$ pwd
/home/jason_chou/redspot_home
$ ls 350m/
dict.txt  gpt2-merges.txt  gpt2-vocab.json  reshard.pt

and then run metaseq-api-local, but it no longer works:

$ metaseq-api-local
2022-10-05 22:19:25 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/350m/reshard.pt
2022-10-05 22:19:26 | INFO | metaseq.checkpoint_utils | Done reading from disk
Traceback (most recent call last):
  File "/home/jason_chou/.conda/envs/user/bin/metaseq-api-local", line 8, in <module>
    sys.exit(cli_main())
  File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main
    distributed_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/default_user/metaseq/metaseq/distributed/utils.py", line 279, in call_main
    return main(cfg, **kwargs)
  File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/default_user/metaseq/metaseq/hub_utils.py", line 565, in load_model
    models, _model_args, _task = _load_checkpoint()
  File "/home/default_user/metaseq/metaseq/hub_utils.py", line 548, in _load_checkpoint
    return checkpoint_utils.load_model_ensemble_and_task(
  File "/home/default_user/metaseq/metaseq/checkpoint_utils.py", line 482, in load_model_ensemble_and_task
    model = build_model_hook(cfg, task)
  File "/home/default_user/metaseq/metaseq/hub_utils.py", line 538, in _build_model
    setattr(cfg["model"], "inference", True)
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 337, in __setattr__
    raise e
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 334, in __setattr__
    self.__set_impl(key, value)
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 318, in __set_impl
    self._set_item_impl(key, value)
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 511, in _set_item_impl
    self._validate_set(key, value)
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 180, in _validate_set
    target = self._get_node(key) if key is not None else self
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 465, in _get_node
    self._validate_get(key)
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 166, in _validate_get
    self._format_and_raise(
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/base.py", line 190, in _format_and_raise
    format_and_raise(
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/_utils.py", line 821, in format_and_raise
    _raise(ex, cause)
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/_utils.py", line 719, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ConfigAttributeError: Key 'inference' is not in struct
    full_key: model.inference
    object_type=dict

Apparently this can be traced back to when setattr(cfg["model"], "inference", True) was added (https://github.com/facebookresearch/metaseq/pull/356). However, another issue surfaced even with that line commented out:

$ metaseq-api-local
2022-10-05 22:23:31 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/350m/reshard.pt
2022-10-05 22:23:31 | INFO | metaseq.checkpoint_utils | Done reading from disk
Traceback (most recent call last):
  File "/home/jason_chou/.conda/envs/user/bin/metaseq-api-local", line 8, in <module>
    sys.exit(cli_main())
  File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main
    distributed_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/default_user/metaseq/metaseq/distributed/utils.py", line 279, in call_main
    return main(cfg, **kwargs)
  File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/default_user/metaseq/metaseq/hub_utils.py", line 565, in load_model
    models, _model_args, _task = _load_checkpoint()
  File "/home/default_user/metaseq/metaseq/hub_utils.py", line 548, in _load_checkpoint
    return checkpoint_utils.load_model_ensemble_and_task(
  File "/home/default_user/metaseq/metaseq/checkpoint_utils.py", line 487, in load_model_ensemble_and_task
    model.load_state_dict(state["model"], strict=strict)
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerLanguageModel:
        Missing key(s) in state_dict: "decoder.layer_norm.weight", "decoder.layer_norm.bias". 

which seems to be due to recent cleanup PRs (https://github.com/facebookresearch/metaseq/pull/366, https://github.com/facebookresearch/metaseq/pull/380, https://github.com/facebookresearch/metaseq/pull/381).

Expected behavior

metaseq-api-local up & running

Environment

  • metaseq Version: latest main (7828d72815a9a581ab47b95876d38cb262741883)
  • PyTorch Version: 1.12.1+cu113
  • OS (e.g., Linux): Ubuntu 18.04.6 LTS
  • How you installed metaseq: pip
  • Python version: 3.10.4
  • CUDA/cuDNN version: CUDA 11.8
  • GPU models and configuration: 1 x T4

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
ruanslvcommented, Oct 6, 2022

Ok did a bit of digging with @suchenzang, here is the summary:

Suggested actions: 1/ Put up a fix for first problem 2/ Keep code related to second issue as is, and instead retrain 350M model with layer norms 3/ Merge code-paths with and without model parallel to avoid similar problems in the future

2reactions
ruanslvcommented, Oct 6, 2022

After commenting out line suggested, second error is caused by this commit in particular https://github.com/facebookresearch/metaseq/commit/c4b33ba6e2cd9b33539bbb5a35d831096bde3282

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solved: /opt partition filling up - Check Point CheckMates
Solved: Hi, The /opt partition is getting filled up. What could be eating up the space? Anything that can be deleted?
Read more >
How to troubleshoot Gaia Portal (WebUI)
After the name of the Security Gateway is changed and SIC is reset with the Management server, there is a certificate error and...
Read more >
Check Point R81.10 Known Limitations
DP-7857, Installing policy immediately after the gateway is upgraded might fail if the Threat Prevention Policy is applicable. Refer to sk174151 ...
Read more >
Gaia Limitations after Snapshot Recovery
All packages that were uploaded with SmartUpdate to the Security Management Server before reverting are invalid after reverting. To fix this, ...
Read more >
High CPU / process crashes / timeout due to large database ...
Check Point recommends to always upgrade to the most recent version. For lower / other versions, Check Point can supply a Hotfix. Contact...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found