Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

restoring optimizer states (with DeepSpeed plugin used)

See original GitHub issue

Accelerate is a great library! Thanks for the amazing work!

I was able to save the optimizer/scheduler states using the Accelerator library, but when restoring them back, I got CUDA out of memory error, I guess the optimizer states are not saved properly. I can restore the states without error by setting ckpt_states = torch.load(state_path, map_location='cpu') but not sure if it’s correct.

Could you provide some tips or suggestions? (I’m implementing a feature that can fully restore the training, but got into this problem) Thanks.

I guess that saving optimizer states for DeepSpeed is different, I saw the HF Trainer does this, this, and this, but not sure how to borrow that code into mine.

my checkpoint saving function is below:

def save_ckpt(cfg, accelerator, model, optimizer, scheduler, epoch, step, score):
    accelerator.wait_for_everyone()
    ckpt_save_dir = Path(cfg.train.ckpt_save_dir)
    ckpt_file = ckpt_save_dir / "checkpoint.txt"
    ckpt_str = f"epoch-{epoch};step-{step};score-{score:.5f}"
    current_save_dir = ckpt_save_dir / ckpt_str
    if accelerator.is_local_main_process:
        current_save_dir.mkdir(parents=True, exist_ok=True)
        ckpt_file.write_text(ckpt_str)

    # save model
    model_path = current_save_dir / "model"
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(
        model_path,
        save_function=accelerator.save,
        state_dict=accelerator.get_state_dict(model),
    )

    # save optimizer, scheduler, scaler, epoch, step
    state_path = current_save_dir / "ckpt_states.pth"
    ckpt_states = {
        "scaler": accelerator.scaler.state_dict(),
        "optimizer": optimizer.state_dict(),
        "scheduler": scheduler.state_dict(),
        "epochs": epoch,
        "steps": step,
    }
    accelerator.save(ckpt_states, state_path)

    # save rng states
    rng_states = {
        "python": random.getstate(),
        "numpy": np.random.get_state(),
        "cpu": torch.random.get_rng_state(),
    }
    local_rank = accelerator.local_process_index
    if torch.cuda.is_available():
        if local_rank == -1:
            # In non distributed, we save the global CUDA RNG state (will take care of DataParallel)
            rng_states["cuda"] = torch.cuda.random.get_rng_state_all()
        else:
            rng_states["cuda"] = torch.cuda.random.get_rng_state()

    if local_rank == -1:
        torch.save(rng_states, current_save_dir / "rng_state.pth")
    else:
        torch.save(rng_states, current_save_dir / f"rng_state_{local_rank}.pth")
    return current_save_dir

my restore function is like:

    # configure optimizer
    optimizer = AdamW() 

    # config scheduler
    scheduler = get_scheduler(
        name=cfg.train.scheduler.name,
        optimizer=optimizer,
        num_warmup_steps=num_warmup_steps,
        num_training_steps=num_training_steps,
    )

    if ckpt_states is not None:
        accelerator.scaler.load_state_dict(ckpt_states["scaler"])
        optimizer.load_state_dict(ckpt_states["optimizer"]) ###### this is where the RuntimeError: CUDA out of memory happened !**
        epoch_steps_trained = ckpt_states["steps"]
        epochs_trained = ckpt_states["epochs"]
        scheduler.load_state_dict(ckpt_states["scheduler"])

    if accelerator.is_local_main_process:
        logger.info(f"{num_training_steps=}, {num_warmup_steps=}, {epochs_trained=}, {epoch_steps_trained=}")

    model = accelerator.prepare_model(model)
    optimizer = accelerator.prepare_optimizer(optimizer)

 ## other code that skips train_dataloader for trained_epochs and trained_steps_in_epoch

Issue Analytics

State:
Created 2 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

3reactions

sguggercommented, Jan 31, 2022

Hi there! We’ll be working on adding a utility to help save/restore checkpoints in the coming month, so it should hopefully be easier to do this when it’s there 😃

1reaction

seanbenhurcommented, Mar 7, 2022

Got it, Thanks

Top Results From Across the Web

DeepSpeed Integration

DeepSpeed Integration. DeepSpeed implements everything described in the ZeRO paper. Currently it provides full support for: Optimizer state partitioning ...

pytorch_lightning.strategies.deepspeed - PyTorch Lightning

Currently only Adam is a DeepSpeed supported optimizer when using ZeRO. logging_batch_size_per_gpu: Config used in DeepSpeed to calculate verbose timing for ...

Getting Started - DeepSpeed

DeepSpeed can automatically save and restore the model, optimizer, and the learning rate scheduler states while hiding away these details ...

Stoke

Module for Stoke to handle _optimizer: StokeOptimizer StokeOptimizer config ... else 0.0 ) def reset_ema(self): """Used to reset the current state of the ......

ZeRO-Infinity and DeepSpeed: Unlocking unprecedented ...

Since the DeepSpeed optimization library was introduced last year, ... The state-of-the-art in large model training technology is 3D ...