Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] stage 3 cannot load the checkpoint when optimizer is not configured

See original GitHub issue

Describe the bug Deepspeed stage 3 cannot load the checkpoint when optimizer is not configured during inferencing.

Error: 'DeepSpeedZeRoOffload' object has no attribute 'checkpoint_event_prologue'

I used a very basic pytorch-lightning script to generate the issue. I am a maintainer at lightning, so feel free to ping me for any information.

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        accelerator='gpu',
        devices=2,
        strategy='deepspeed_stage_3'
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.save_checkpoint('deepspeed_ckpt')
    trainer.test(model, dataloaders=test_data, ckpt_path='deepspeed_ckpt')


if __name__ == "__main__":
    run()

To Reproduce Steps to reproduce the behavior:

Run the above script
deepspeed==0.7.4, torch==1.12.1+cu116
python script_name.py

Expected behavior It should work without any error since the engine is reinitilaized for inference without any optimizer as mentioned here: https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#deepspeed.DeepSpeedEngine.load_checkpoint

ds_report output Please run ds_report to give us details about your setup.

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/rohit/miniconda3/envs/pl/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1+cu116
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed install path ........... ['/home/rohit/miniconda3/envs/pl/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.4, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. two machines with x8 A100s each]
(if applicable) what DeepSpeed-MII version are you using
(if applicable) Hugging Face Transformers/Accelerate/etc. versions
Python version
Any other relevant info about your setup

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

Issue Analytics

State:
Created a year ago
Comments:10 (7 by maintainers)

Top GitHub Comments

1reaction

ShijieZZZZcommented, Nov 14, 2022

Observed reported issue with 1.8.0. Investigating.

0reactions

jeffracommented, Dec 2, 2022

@rohitgr7, please re-open if you’re still having issues here.

Top Results From Across the Web

Troubleshoot Tableau Server Install and Upgrade

Many Tableau Server issues can be addressed with some basic steps: Make sure there is enough disk space on each computer running Tableau...

Model Optimizer: .pb created using transfer learning ...

Hello, I generated a .pb model using Keras and tensorflow (version 1.14.0-rc1) with transfer learning method using ResNet50. Below the command used.

LightningModule - PyTorch Lightning - Read the Docs

The number of optimizer steps taken (does not reset each epoch). This includes multiple optimizers and TBPTT steps (if enabled). def training_step(self ...

A Guide To Using Checkpoints — Ray 2.2.0

Trial-level checkpoints capture the per-trial state. They are saved by the trainable itself. Commonly, this includes the model and optimizer states.

Trainer - Hugging Face

The optimizer of the trainer must have been set up either before this method is ... If the callback is not found, returns...