Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

why cpu_checkpointing can't work?

See original GitHub issue

I have partition_activations and cpu_checkpointing enabled, but it seems like activations still on GPU, I have just one GPU, can’t do model parallel, do cpu_checkpointing just work for model parallel? why single GPU same as 1 GPU model parallel can’t offload all its checkpoints to the CPU? My CPU memory is enough, configs:

{
       'zero_optimization': {
          'stage': 2,
          'cpu_offload': True,
          'contiguous_gradients': True,
          },
       'train_batch_size': 2,
       'fp16': {
          'enabled': True,
          "loss_scale": 0,
          "loss_scale_window": 1000,
          "hysteresis": 2,
          "min_loss_scale": 1,
          },
        "activation_checkpointing": {
          "partition_activations": True,
          "contiguous_memory_optimization": True,
          "cpu_checkpointing": True
        },
       "wall_clock_breakdown": False,
}

Environment: python 3.6 torch 1.6.0 deepspeed 0.3.7

Issue Analytics

State:
Created 3 years ago
Comments:18 (11 by maintainers)

Top GitHub Comments

1reaction

tjruwasecommented, Mar 2, 2022

@hpourmodheji, that is helpful context. We did not enable activation checkpointing for bert because models that are less ~1B may not benefit much given the overhead of re-computation introduced by activation checkpointing. However, if you want to enable it do the following:

Switch the flag here to True
Replace this import with from deepspeed.runtime.activation_checkpointing.checkpointing import checkpoint

0reactions

hpourmodhejicommented, Mar 4, 2022

@tjruwase, Thank you so much for your help. I have also changed the following line: checkpoint.checkpoint(…) => checkpoint(…). It is working now. Thanks for your patience and help.