restoring optimizer states (with DeepSpeed plugin used)
See original GitHub issueAccelerate is a great library! Thanks for the amazing work!
I was able to save the optimizer/scheduler states using the Accelerator library, but when restoring them back, I got CUDA out of memory error, I guess the optimizer states are not saved properly. I can restore the states without error by setting ckpt_states = torch.load(state_path, map_location='cpu')
but not sure if it’s correct.
Could you provide some tips or suggestions? (I’m implementing a feature that can fully restore the training, but got into this problem) Thanks.
I guess that saving optimizer states for DeepSpeed is different, I saw the HF Trainer does this, this, and this, but not sure how to borrow that code into mine.
my checkpoint saving function is below:
def save_ckpt(cfg, accelerator, model, optimizer, scheduler, epoch, step, score):
accelerator.wait_for_everyone()
ckpt_save_dir = Path(cfg.train.ckpt_save_dir)
ckpt_file = ckpt_save_dir / "checkpoint.txt"
ckpt_str = f"epoch-{epoch};step-{step};score-{score:.5f}"
current_save_dir = ckpt_save_dir / ckpt_str
if accelerator.is_local_main_process:
current_save_dir.mkdir(parents=True, exist_ok=True)
ckpt_file.write_text(ckpt_str)
# save model
model_path = current_save_dir / "model"
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(
model_path,
save_function=accelerator.save,
state_dict=accelerator.get_state_dict(model),
)
# save optimizer, scheduler, scaler, epoch, step
state_path = current_save_dir / "ckpt_states.pth"
ckpt_states = {
"scaler": accelerator.scaler.state_dict(),
"optimizer": optimizer.state_dict(),
"scheduler": scheduler.state_dict(),
"epochs": epoch,
"steps": step,
}
accelerator.save(ckpt_states, state_path)
# save rng states
rng_states = {
"python": random.getstate(),
"numpy": np.random.get_state(),
"cpu": torch.random.get_rng_state(),
}
local_rank = accelerator.local_process_index
if torch.cuda.is_available():
if local_rank == -1:
# In non distributed, we save the global CUDA RNG state (will take care of DataParallel)
rng_states["cuda"] = torch.cuda.random.get_rng_state_all()
else:
rng_states["cuda"] = torch.cuda.random.get_rng_state()
if local_rank == -1:
torch.save(rng_states, current_save_dir / "rng_state.pth")
else:
torch.save(rng_states, current_save_dir / f"rng_state_{local_rank}.pth")
return current_save_dir
my restore function is like:
# configure optimizer
optimizer = AdamW()
# config scheduler
scheduler = get_scheduler(
name=cfg.train.scheduler.name,
optimizer=optimizer,
num_warmup_steps=num_warmup_steps,
num_training_steps=num_training_steps,
)
if ckpt_states is not None:
accelerator.scaler.load_state_dict(ckpt_states["scaler"])
optimizer.load_state_dict(ckpt_states["optimizer"]) ###### this is where the RuntimeError: CUDA out of memory happened !**
epoch_steps_trained = ckpt_states["steps"]
epochs_trained = ckpt_states["epochs"]
scheduler.load_state_dict(ckpt_states["scheduler"])
if accelerator.is_local_main_process:
logger.info(f"{num_training_steps=}, {num_warmup_steps=}, {epochs_trained=}, {epoch_steps_trained=}")
model = accelerator.prepare_model(model)
optimizer = accelerator.prepare_optimizer(optimizer)
## other code that skips train_dataloader for trained_epochs and trained_steps_in_epoch
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
Hi there! We’ll be working on adding a utility to help save/restore checkpoints in the coming month, so it should hopefully be easier to do this when it’s there 😃
Got it, Thanks