Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Checkpoint loading not working correctly

See original GitHub issue

We are having a critical issue in gpt-neox where the checkpoint loading from deepspeed isn’t working correctly. When we reload the model from a checkpoint, the loss suggests it hasn’t loaded anything at all, and is just starting again from initialization… (jumps up from ~4 at checkpoint save to 8 at checkpoint load)

The model is based on the megatron example in DeepSpeedExamples, and the checkpoint saving / loading as far as I’m aware, hasn’t changed at all from there (and should be handled almost entirely by model_engine.save_checkpoint / load_checkpoint.)

It happens both with the pipeline parallel module, and without.

You can see here an example of training loss before saving a checkpoint

Screenshot from 2021-04-20 13-28-27

and then the loss directly afterward:

Screenshot from 2021-04-20 13-30-55

Would appreciate any help to decipher what’s going on here.

We’re technically using our deepspeed fork but there are no modification to the checkpoint saving / loading logic there either.

This is about as minimal a reproduction as i can give right now:

first set up the environment in bash: (this is assuming you’re running on an aws machine, but should work on any base image with torch + cuda 11.1, just comment out that first line)

# first activate the correct AWS env with torch 1.8.0 + cuda 11.1
source activate pytorch_latest_p37

# clone repo and cd into it
git clone https://github.com/EleutherAI/gpt-neox
cd gpt-neox

# install other requirements
pip install pybind11==2.6.2 six regex numpy==1.20.1 nltk==3.5 \
        -e git+git://github.com/EleutherAI/DeeperSpeed.git@0e9573776caf7227ecafdc8a3d57c7955165d8e2#egg=deepspeed \
        zstandard==0.15.1 cupy-cuda111 mpi4py==3.0.3 wandb==0.10.21 einops==0.3.0 transformers tokenizers lm_dataformat triton==1.0.0.dev20210329

# install apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex.git@e2083df5eb96643c61613b9df48dd4eea6b07690

# prepare dummy gpt2 data
python prepare_data.py

then in a python script, train a small model and try to continue training:

from yaml import load, dump
try:
    from yaml import CLoader as Loader, CDumper as Dumper
except ImportError:
    from yaml import Loader, Dumper
import os

# modify train iters in small.yml to 500
with open('configs/small.yml', 'r') as f:
        data = load(f, Loader=Loader)
data.update({'train-iters': 500})
with open('configs/tmp.yml', 'w') as f:
        f.write(dump(data))

# start run
os.system('./deepy.py pretrain_gpt2.py -d configs tmp.yml local_setup.yml')

# kill any hanging processes
os.system("pkill -f deepy.py")

# update train iters again
with open('configs/small.yml', 'r') as f:
        data = load(f, Loader=Loader)
data.update({'train-iters': 550})
with open('configs/tmp.yml', 'w') as f:
        f.write(dump(data))

# start run again
os.system('./deepy.py pretrain_gpt2.py -d configs tmp.yml local_setup.yml')

Issue Analytics

State:
Created 2 years ago
Comments:10 (10 by maintainers)

Top GitHub Comments

1reaction

sdtblckcommented, Apr 25, 2021

Hi @tjruwase - I found the bug.

Actually nothing to do with deepspeed, but it does also affect the megatron examples in deepspeed examples, so you might want to fix it.

Fix described here -> https://github.com/EleutherAI/gpt-neox/pull/251

0reactions

tjruwasecommented, Apr 26, 2021

@sdtblck, thanks for hard work that went into resolving this. Can you please close this issue?

Top Results From Across the Web

Check Points Website and loading problem after sea...

Check Points Website "Loading" for ever. Anyone experience loading problems with the website. Has been going on for a long time but seems....

Checkpoint Loading not Working As Expected - PyTorch Forums

Hi all! Recently, I started implementing checkpoint saves on my code, as it has become difficult running everything in one go.

Loading model from checkpoint is not working - Stack Overflow

When I try and use the trained model I am unable to load the weights using load_from_checkpoint . It seems there is a...

Checkpoint Engage integration frequently asked questions

Below are some frequently asked questions about integrating Checkpoint Engage with AdvanceFlow. How do I set up integration with Checkpoint Engage for the ......

Using checkpoints to revert virtual machines to a previous state

A snapshot is not a full backup and can cause data consistency issues with systems that replicate data between different nodes such as ......