question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

zero_to_fp32.py still imports a wrong model after fix

See original GitHub issue

Same as #1165; the issue is closed now so I’m opening a new one. It looks like the zero_to_fp32.py still cannot load the correct weights under multi-gpu and ZERO-2 setting.

Loading ‘mp_rank_00_model_states.pt’:

states[‘module’][‘deberta.encoder.layer.0.output.dense.weight’] tensor([[-0.0211, 0.0068, 0.0206, …, 0.0057, 0.0316, 0.0256], [ 0.0273, 0.0141, 0.0118, …, -0.0122, 0.0054, 0.0010], [ 0.0479, -0.0237, -0.0604, …, -0.0340, -0.0183, 0.0691], …, [ 0.0270, -0.0231, 0.0218, …, 0.0563, 0.0641, -0.0094], [-0.0563, -0.0837, -0.0427, …, 0.0242, -0.0132, -0.0512], [-0.0012, 0.0064, 0.0465, …, 0.0219, 0.0259, -0.0281]], device=‘cuda:0’, dtype=torch.float16)

Loading the exported weights using load_state_dict_from_zero_checkpoint: (Pdb) self.deberta.encoder.layer[0].output.dense.weight Parameter containing: tensor([[ 0.0207, -0.0448, 0.0022, …, 0.0406, -0.0338, -0.0174], [-0.0577, -0.0648, 0.0404, …, 0.0108, -0.0167, -0.0100], [ 0.0548, 0.0063, 0.0024, …, 0.0311, 0.0249, 0.0167], …, [-0.0081, 0.0194, -0.0266, …, -0.0269, -0.0002, 0.0257], [ 0.0202, -0.0002, 0.0831, …, -0.0008, -0.0094, 0.0258], [-0.0320, 0.0529, -0.0259, …, 0.0117, -0.0292, -0.0064]],

I’m using Deepspeed 0.4.5. I can confirm the problem happens on 2gpus, and not on 1gpu.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:29 (26 by maintainers)

github_iconTop GitHub Comments

1reaction
skpigcommented, Oct 1, 2021

@stas00 I’m sorry that I may have made a mistake.

  1. As I previously commented (https://github.com/microsoft/DeepSpeed/issues/1317#issuecomment-929753367), the test failed with my code tweaking the model params. And the test also failed with clean code as I mentioned (https://github.com/microsoft/DeepSpeed/issues/1317#issuecomment-929783068).
  2. But the -sv args doesn’t print any traceback information, so at that time I didn’t realize what caused the failure. Now I try the -v args and print the traceback message as follow. It seems related to some permission error instead of the DeepSpeed’s bug.
Traceback (most recent call last):                                                                                                                                                   
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap                                                                             
    self.run()                                                                                                                                                                       
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/multiprocessing/process.py", line 93, in run                                                                                     
    self._target(*self._args, **self._kwargs)                                                                                                                                        
  File "/home/huangbz/git_repo/DeepSpeed/tests/unit/common.py", line 53, in dist_init                                                                                                
    run_func(*func_args, **func_kwargs)                                                                                                                                              
  File "/home/huangbz/git_repo/DeepSpeed/tests/unit/test_zero.py", line 196, in _test_zero_to_fp32
    model_parameters=model.parameters())
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/__init__.py", line 141, in initialize
    config_params=config_params)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 220, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 860, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 942, in _configure_basic_optimizer
    adam_w_mode=effective_adam_w_mode)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 355, in load
    return self.jit_load(verbose)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 380, in jit_load
    os.makedirs(ext_path, exist_ok=True)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/tmp/torch_extensions/fused_adam'
  1. I’m now testing the PR, I’ll reply to you as soon as possible.
1reaction
skpigcommented, Sep 29, 2021

@stas00 Sorry for the last response. I tried to tweak the model as below to reproduce the total model params of 50265.

@distributed_test(world_size=[2])
def _test_zero_to_fp32():
    class MyModel(torch.nn.Module):
        def __init__(self, hidden_dim, n_layers):
            super().__init__()
            # to reproduce https://github.com/microsoft/DeepSpeed/pull/1372 it is important that
            # the number of total elements is uneven:
            # (1) 4188 layers of 3*(3+1)=12 elements each, 50256 in total
            self.ll = torch.nn.ModuleList(
                torch.nn.Linear(hidden_dim,
                                hidden_dim) for i in range(n_layers))
            # (2) the following adds 8+1=9 elements
            self.classifier = torch.nn.Linear(8, 1)
            # total 50256 + 9 = 50265 (uneven as desired) elements
            self.cross_entropy_loss = torch.nn.CrossEntropyLoss()

        def forward(self, x, y):
            hidden = x
            for l in self.ll:
                hidden = l(hidden)
            return self.cross_entropy_loss(hidden, y)

    args = args_from_dict(tmpdir, config_dict)
    hidden_dim = 3  # do not change

    world_size = dist.get_world_size()
    # we want at least 2x layers as there are gpus to trigger round_robin_fp16_groups reshuffle in zero2
    n_layers = world_size * 2094 # total 4188 layers

And ran the test as you told. It seemed that the test failed

pyt --forked tests/unit/test_zero.py -k test_zero_to_fp32[2] -sv

tests/unit/test_zero.py::test_zero_to_fp32[2] FAILED

===================================================================================== FAILURES ======================================================================================
_______________________________________________________________________________ test_zero_to_fp32[2] ________________________________________________________________________________
Worker 0 exited with code 1
============================================================================== short test summary info ==============================================================================
FAILED tests/unit/test_zero.py::test_zero_to_fp32[2]
========================================================================= 1 failed, 10 deselected in 35.30s =========================================================================

Should I make a Pull Request ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Can't import my own modules in Python - Stack Overflow
I tried simply using import in my TestCase.py but it still gave me the same error. I'm assuming it's because its in a...
Read more >
Python import: Advanced Techniques and Tips
The Python import system is as powerful as it is useful. In this in-depth tutorial, you'll learn how to harness this power to...
Read more >
5. The import system — Python 3.11.1 documentation
The import statement is the most common way of invoking the import machinery, ... since packages and modules need not originate from the...
Read more >
Traps for the Unwary in Python's Import System
This is an all new trap added in Python 3.3 as a consequence of fixing the previous trap: if a subdirectory encountered on...
Read more >
Understanding Python imports, __init__.py and pythonpath
Directory structure for learning Python imports. Before we even begin, let's understand the difference between a package and a module since ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found