question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

restoring optimizer states (with DeepSpeed plugin used)

See original GitHub issue

Accelerate is a great library! Thanks for the amazing work!

I was able to save the optimizer/scheduler states using the Accelerator library, but when restoring them back, I got CUDA out of memory error, I guess the optimizer states are not saved properly. I can restore the states without error by setting ckpt_states = torch.load(state_path, map_location='cpu') but not sure if it’s correct.

Could you provide some tips or suggestions? (I’m implementing a feature that can fully restore the training, but got into this problem) Thanks.

I guess that saving optimizer states for DeepSpeed is different, I saw the HF Trainer does this, this, and this, but not sure how to borrow that code into mine.

my checkpoint saving function is below:

def save_ckpt(cfg, accelerator, model, optimizer, scheduler, epoch, step, score):
    accelerator.wait_for_everyone()
    ckpt_save_dir = Path(cfg.train.ckpt_save_dir)
    ckpt_file = ckpt_save_dir / "checkpoint.txt"
    ckpt_str = f"epoch-{epoch};step-{step};score-{score:.5f}"
    current_save_dir = ckpt_save_dir / ckpt_str
    if accelerator.is_local_main_process:
        current_save_dir.mkdir(parents=True, exist_ok=True)
        ckpt_file.write_text(ckpt_str)

    # save model
    model_path = current_save_dir / "model"
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(
        model_path,
        save_function=accelerator.save,
        state_dict=accelerator.get_state_dict(model),
    )

    # save optimizer, scheduler, scaler, epoch, step
    state_path = current_save_dir / "ckpt_states.pth"
    ckpt_states = {
        "scaler": accelerator.scaler.state_dict(),
        "optimizer": optimizer.state_dict(),
        "scheduler": scheduler.state_dict(),
        "epochs": epoch,
        "steps": step,
    }
    accelerator.save(ckpt_states, state_path)

    # save rng states
    rng_states = {
        "python": random.getstate(),
        "numpy": np.random.get_state(),
        "cpu": torch.random.get_rng_state(),
    }
    local_rank = accelerator.local_process_index
    if torch.cuda.is_available():
        if local_rank == -1:
            # In non distributed, we save the global CUDA RNG state (will take care of DataParallel)
            rng_states["cuda"] = torch.cuda.random.get_rng_state_all()
        else:
            rng_states["cuda"] = torch.cuda.random.get_rng_state()

    if local_rank == -1:
        torch.save(rng_states, current_save_dir / "rng_state.pth")
    else:
        torch.save(rng_states, current_save_dir / f"rng_state_{local_rank}.pth")
    return current_save_dir

my restore function is like:

    # configure optimizer
    optimizer = AdamW() 

    # config scheduler
    scheduler = get_scheduler(
        name=cfg.train.scheduler.name,
        optimizer=optimizer,
        num_warmup_steps=num_warmup_steps,
        num_training_steps=num_training_steps,
    )

    if ckpt_states is not None:
        accelerator.scaler.load_state_dict(ckpt_states["scaler"])
        optimizer.load_state_dict(ckpt_states["optimizer"]) ###### this is where the RuntimeError: CUDA out of memory happened !**
        epoch_steps_trained = ckpt_states["steps"]
        epochs_trained = ckpt_states["epochs"]
        scheduler.load_state_dict(ckpt_states["scheduler"])

    if accelerator.is_local_main_process:
        logger.info(f"{num_training_steps=}, {num_warmup_steps=}, {epochs_trained=}, {epoch_steps_trained=}")

    model = accelerator.prepare_model(model)
    optimizer = accelerator.prepare_optimizer(optimizer)

 ## other code that skips train_dataloader for trained_epochs and trained_steps_in_epoch

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
sguggercommented, Jan 31, 2022

Hi there! We’ll be working on adding a utility to help save/restore checkpoints in the coming month, so it should hopefully be easier to do this when it’s there 😃

1reaction
seanbenhurcommented, Mar 7, 2022

Got it, Thanks

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed Integration
DeepSpeed Integration. DeepSpeed implements everything described in the ZeRO paper. Currently it provides full support for: Optimizer state partitioning ...
Read more >
pytorch_lightning.strategies.deepspeed - PyTorch Lightning
Currently only Adam is a DeepSpeed supported optimizer when using ZeRO. logging_batch_size_per_gpu: Config used in DeepSpeed to calculate verbose timing for ...
Read more >
Getting Started - DeepSpeed
DeepSpeed can automatically save and restore the model, optimizer, and the learning rate scheduler states while hiding away these details ...
Read more >
Stoke
Module for Stoke to handle _optimizer: StokeOptimizer StokeOptimizer config ... else 0.0 ) def reset_ema(self): """Used to reset the current state of the ......
Read more >
ZeRO-Infinity and DeepSpeed: Unlocking unprecedented ...
Since the DeepSpeed optimization library was introduced last year, ... The state-of-the-art in large model training technology is 3D ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found