zero_to_fp32.py still imports a wrong model after fix
See original GitHub issueSame as #1165; the issue is closed now so I’m opening a new one. It looks like the zero_to_fp32.py still cannot load the correct weights under multi-gpu and ZERO-2 setting.
Loading ‘mp_rank_00_model_states.pt’:
states[‘module’][‘deberta.encoder.layer.0.output.dense.weight’] tensor([[-0.0211, 0.0068, 0.0206, …, 0.0057, 0.0316, 0.0256], [ 0.0273, 0.0141, 0.0118, …, -0.0122, 0.0054, 0.0010], [ 0.0479, -0.0237, -0.0604, …, -0.0340, -0.0183, 0.0691], …, [ 0.0270, -0.0231, 0.0218, …, 0.0563, 0.0641, -0.0094], [-0.0563, -0.0837, -0.0427, …, 0.0242, -0.0132, -0.0512], [-0.0012, 0.0064, 0.0465, …, 0.0219, 0.0259, -0.0281]], device=‘cuda:0’, dtype=torch.float16)
Loading the exported weights using load_state_dict_from_zero_checkpoint
:
(Pdb) self.deberta.encoder.layer[0].output.dense.weight
Parameter containing:
tensor([[ 0.0207, -0.0448, 0.0022, …, 0.0406, -0.0338, -0.0174],
[-0.0577, -0.0648, 0.0404, …, 0.0108, -0.0167, -0.0100],
[ 0.0548, 0.0063, 0.0024, …, 0.0311, 0.0249, 0.0167],
…,
[-0.0081, 0.0194, -0.0266, …, -0.0269, -0.0002, 0.0257],
[ 0.0202, -0.0002, 0.0831, …, -0.0008, -0.0094, 0.0258],
[-0.0320, 0.0529, -0.0259, …, 0.0117, -0.0292, -0.0064]],
I’m using Deepspeed 0.4.5. I can confirm the problem happens on 2gpus, and not on 1gpu.
Issue Analytics
- State:
- Created 2 years ago
- Comments:29 (26 by maintainers)
Top GitHub Comments
@stas00 I’m sorry that I may have made a mistake.
-sv
args doesn’t print any traceback information, so at that time I didn’t realize what caused the failure. Now I try the-v
args and print the traceback message as follow. It seems related to some permission error instead of the DeepSpeed’s bug.@stas00 Sorry for the last response. I tried to tweak the model as below to reproduce the total model params of 50265.
And ran the test as you told. It seemed that the test failed
Should I make a Pull Request ?