Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

zero_to_fp32.py still imports a wrong model after fix

See original GitHub issue

Same as #1165; the issue is closed now so I’m opening a new one. It looks like the zero_to_fp32.py still cannot load the correct weights under multi-gpu and ZERO-2 setting.

Loading ‘mp_rank_00_model_states.pt’:

states[‘module’][‘deberta.encoder.layer.0.output.dense.weight’] tensor([[-0.0211, 0.0068, 0.0206, …, 0.0057, 0.0316, 0.0256], [ 0.0273, 0.0141, 0.0118, …, -0.0122, 0.0054, 0.0010], [ 0.0479, -0.0237, -0.0604, …, -0.0340, -0.0183, 0.0691], …, [ 0.0270, -0.0231, 0.0218, …, 0.0563, 0.0641, -0.0094], [-0.0563, -0.0837, -0.0427, …, 0.0242, -0.0132, -0.0512], [-0.0012, 0.0064, 0.0465, …, 0.0219, 0.0259, -0.0281]], device=‘cuda:0’, dtype=torch.float16)

Loading the exported weights using load_state_dict_from_zero_checkpoint: (Pdb) self.deberta.encoder.layer[0].output.dense.weight Parameter containing: tensor([[ 0.0207, -0.0448, 0.0022, …, 0.0406, -0.0338, -0.0174], [-0.0577, -0.0648, 0.0404, …, 0.0108, -0.0167, -0.0100], [ 0.0548, 0.0063, 0.0024, …, 0.0311, 0.0249, 0.0167], …, [-0.0081, 0.0194, -0.0266, …, -0.0269, -0.0002, 0.0257], [ 0.0202, -0.0002, 0.0831, …, -0.0008, -0.0094, 0.0258], [-0.0320, 0.0529, -0.0259, …, 0.0117, -0.0292, -0.0064]],

I’m using Deepspeed 0.4.5. I can confirm the problem happens on 2gpus, and not on 1gpu.

Issue Analytics

State:
Created 2 years ago
Comments:29 (26 by maintainers)

Top GitHub Comments

1reaction

skpigcommented, Oct 1, 2021

@stas00 I’m sorry that I may have made a mistake.

As I previously commented (https://github.com/microsoft/DeepSpeed/issues/1317#issuecomment-929753367), the test failed with my code tweaking the model params. And the test also failed with clean code as I mentioned (https://github.com/microsoft/DeepSpeed/issues/1317#issuecomment-929783068).
But the -sv args doesn’t print any traceback information, so at that time I didn’t realize what caused the failure. Now I try the -v args and print the traceback message as follow. It seems related to some permission error instead of the DeepSpeed’s bug.

Traceback (most recent call last):                                                                                                                                                   
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap                                                                             
    self.run()                                                                                                                                                                       
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/multiprocessing/process.py", line 93, in run                                                                                     
    self._target(*self._args, **self._kwargs)                                                                                                                                        
  File "/home/huangbz/git_repo/DeepSpeed/tests/unit/common.py", line 53, in dist_init                                                                                                
    run_func(*func_args, **func_kwargs)                                                                                                                                              
  File "/home/huangbz/git_repo/DeepSpeed/tests/unit/test_zero.py", line 196, in _test_zero_to_fp32
    model_parameters=model.parameters())
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/__init__.py", line 141, in initialize
    config_params=config_params)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 220, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 860, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 942, in _configure_basic_optimizer
    adam_w_mode=effective_adam_w_mode)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 355, in load
    return self.jit_load(verbose)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 380, in jit_load
    os.makedirs(ext_path, exist_ok=True)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/tmp/torch_extensions/fused_adam'

I’m now testing the PR, I’ll reply to you as soon as possible.

1reaction

skpigcommented, Sep 29, 2021

@stas00 Sorry for the last response. I tried to tweak the model as below to reproduce the total model params of 50265.

@distributed_test(world_size=[2])
def _test_zero_to_fp32():
    class MyModel(torch.nn.Module):
        def __init__(self, hidden_dim, n_layers):
            super().__init__()
            # to reproduce https://github.com/microsoft/DeepSpeed/pull/1372 it is important that
            # the number of total elements is uneven:
            # (1) 4188 layers of 3*(3+1)=12 elements each, 50256 in total
            self.ll = torch.nn.ModuleList(
                torch.nn.Linear(hidden_dim,
                                hidden_dim) for i in range(n_layers))
            # (2) the following adds 8+1=9 elements
            self.classifier = torch.nn.Linear(8, 1)
            # total 50256 + 9 = 50265 (uneven as desired) elements
            self.cross_entropy_loss = torch.nn.CrossEntropyLoss()

        def forward(self, x, y):
            hidden = x
            for l in self.ll:
                hidden = l(hidden)
            return self.cross_entropy_loss(hidden, y)

    args = args_from_dict(tmpdir, config_dict)
    hidden_dim = 3  # do not change

    world_size = dist.get_world_size()
    # we want at least 2x layers as there are gpus to trigger round_robin_fp16_groups reshuffle in zero2
    n_layers = world_size * 2094 # total 4188 layers

And ran the test as you told. It seemed that the test failed

pyt --forked tests/unit/test_zero.py -k test_zero_to_fp32[2] -sv

tests/unit/test_zero.py::test_zero_to_fp32[2] FAILED

===================================================================================== FAILURES ======================================================================================
_______________________________________________________________________________ test_zero_to_fp32[2] ________________________________________________________________________________
Worker 0 exited with code 1
============================================================================== short test summary info ==============================================================================
FAILED tests/unit/test_zero.py::test_zero_to_fp32[2]
========================================================================= 1 failed, 10 deselected in 35.30s =========================================================================

Should I make a Pull Request ?