No longer able to load provided OPT checkpoint after recent changes
See original GitHub issue🐛 Bug
No longer able to load provided OPT checkpoint after recent changes
To Reproduce
Edit metaseq/service/constants.py
as before, in my case:
MAX_SEQ_LEN = 2048
BATCH_SIZE = 2048 # silly high bc we dynamically batch by MAX_BATCH_TOKENS
MAX_BATCH_TOKENS = 3072
DEFAULT_PORT = 6010
MODEL_PARALLEL = 1
TOTAL_WORLD_SIZE = 1
MAX_BEAM = 16
try:
# internal logic denoting where checkpoints are in meta infrastructure
from metaseq_internal.constants import CHECKPOINT_FOLDER
except ImportError:
CHECKPOINT_FOLDER = "/home/jason_chou/redspot_home/350m/"
(...)
where
$ pwd
/home/jason_chou/redspot_home
$ ls 350m/
dict.txt gpt2-merges.txt gpt2-vocab.json reshard.pt
and then run metaseq-api-local
, but it no longer works:
$ metaseq-api-local
2022-10-05 22:19:25 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/350m/reshard.pt
2022-10-05 22:19:26 | INFO | metaseq.checkpoint_utils | Done reading from disk
Traceback (most recent call last):
File "/home/jason_chou/.conda/envs/user/bin/metaseq-api-local", line 8, in <module>
sys.exit(cli_main())
File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main
distributed_utils.call_main(cfg, worker_main, namespace_args=args)
File "/home/default_user/metaseq/metaseq/distributed/utils.py", line 279, in call_main
return main(cfg, **kwargs)
File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main
models = generator.load_model() # noqa: F841
File "/home/default_user/metaseq/metaseq/hub_utils.py", line 565, in load_model
models, _model_args, _task = _load_checkpoint()
File "/home/default_user/metaseq/metaseq/hub_utils.py", line 548, in _load_checkpoint
return checkpoint_utils.load_model_ensemble_and_task(
File "/home/default_user/metaseq/metaseq/checkpoint_utils.py", line 482, in load_model_ensemble_and_task
model = build_model_hook(cfg, task)
File "/home/default_user/metaseq/metaseq/hub_utils.py", line 538, in _build_model
setattr(cfg["model"], "inference", True)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 337, in __setattr__
raise e
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 334, in __setattr__
self.__set_impl(key, value)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 318, in __set_impl
self._set_item_impl(key, value)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 511, in _set_item_impl
self._validate_set(key, value)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 180, in _validate_set
target = self._get_node(key) if key is not None else self
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 465, in _get_node
self._validate_get(key)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 166, in _validate_get
self._format_and_raise(
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/base.py", line 190, in _format_and_raise
format_and_raise(
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/_utils.py", line 821, in format_and_raise
_raise(ex, cause)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/_utils.py", line 719, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ConfigAttributeError: Key 'inference' is not in struct
full_key: model.inference
object_type=dict
Apparently this can be traced back to when setattr(cfg["model"], "inference", True)
was added (https://github.com/facebookresearch/metaseq/pull/356). However, another issue surfaced even with that line commented out:
$ metaseq-api-local
2022-10-05 22:23:31 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/350m/reshard.pt
2022-10-05 22:23:31 | INFO | metaseq.checkpoint_utils | Done reading from disk
Traceback (most recent call last):
File "/home/jason_chou/.conda/envs/user/bin/metaseq-api-local", line 8, in <module>
sys.exit(cli_main())
File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main
distributed_utils.call_main(cfg, worker_main, namespace_args=args)
File "/home/default_user/metaseq/metaseq/distributed/utils.py", line 279, in call_main
return main(cfg, **kwargs)
File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main
models = generator.load_model() # noqa: F841
File "/home/default_user/metaseq/metaseq/hub_utils.py", line 565, in load_model
models, _model_args, _task = _load_checkpoint()
File "/home/default_user/metaseq/metaseq/hub_utils.py", line 548, in _load_checkpoint
return checkpoint_utils.load_model_ensemble_and_task(
File "/home/default_user/metaseq/metaseq/checkpoint_utils.py", line 487, in load_model_ensemble_and_task
model.load_state_dict(state["model"], strict=strict)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerLanguageModel:
Missing key(s) in state_dict: "decoder.layer_norm.weight", "decoder.layer_norm.bias".
which seems to be due to recent cleanup PRs (https://github.com/facebookresearch/metaseq/pull/366, https://github.com/facebookresearch/metaseq/pull/380, https://github.com/facebookresearch/metaseq/pull/381).
Expected behavior
metaseq-api-local
up & running
Environment
- metaseq Version: latest main (7828d72815a9a581ab47b95876d38cb262741883)
- PyTorch Version: 1.12.1+cu113
- OS (e.g., Linux): Ubuntu 18.04.6 LTS
- How you installed metaseq:
pip
- Python version: 3.10.4
- CUDA/cuDNN version: CUDA 11.8
- GPU models and configuration: 1 x T4
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:11 (7 by maintainers)
Top Results From Across the Web
Solved: /opt partition filling up - Check Point CheckMates
Solved: Hi, The /opt partition is getting filled up. What could be eating up the space? Anything that can be deleted?
Read more >How to troubleshoot Gaia Portal (WebUI)
After the name of the Security Gateway is changed and SIC is reset with the Management server, there is a certificate error and...
Read more >Check Point R81.10 Known Limitations
DP-7857, Installing policy immediately after the gateway is upgraded might fail if the Threat Prevention Policy is applicable. Refer to sk174151 ...
Read more >Gaia Limitations after Snapshot Recovery
All packages that were uploaded with SmartUpdate to the Security Management Server before reverting are invalid after reverting. To fix this, ...
Read more >High CPU / process crashes / timeout due to large database ...
Check Point recommends to always upgrade to the most recent version. For lower / other versions, Check Point can supply a Hotfix. Contact...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Ok did a bit of digging with @suchenzang, here is the summary:
setattr(cfg["model"], "inference", True)
from https://github.com/facebookresearch/metaseq/commit/493e6017c18f7c2d3cd697693e6f9e33592f3612 is a bug, figuring out best way to fix it and putting out a fix;self.layer_norm = None
.Suggested actions: 1/ Put up a fix for first problem 2/ Keep code related to second issue as is, and instead retrain 350M model with layer norms 3/ Merge code-paths with and without model parallel to avoid similar problems in the future
After commenting out line suggested, second error is caused by this commit in particular https://github.com/facebookresearch/metaseq/commit/c4b33ba6e2cd9b33539bbb5a35d831096bde3282