question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

why cpu_checkpointing can't work?

See original GitHub issue

I have partition_activations and cpu_checkpointing enabled, but it seems like activations still on GPU, I have just one GPU, can’t do model parallel, do cpu_checkpointing just work for model parallel? why single GPU same as 1 GPU model parallel can’t offload all its checkpoints to the CPU? My CPU memory is enough, configs:

{
       'zero_optimization': {
          'stage': 2,
          'cpu_offload': True,
          'contiguous_gradients': True,
          },
       'train_batch_size': 2,
       'fp16': {
          'enabled': True,
          "loss_scale": 0,
          "loss_scale_window": 1000,
          "hysteresis": 2,
          "min_loss_scale": 1,
          },
        "activation_checkpointing": {
          "partition_activations": True,
          "contiguous_memory_optimization": True,
          "cpu_checkpointing": True
        },
       "wall_clock_breakdown": False,
}

Environment: python 3.6 torch 1.6.0 deepspeed 0.3.7

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:18 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
tjruwasecommented, Mar 2, 2022

@hpourmodheji, that is helpful context. We did not enable activation checkpointing for bert because models that are less ~1B may not benefit much given the overhead of re-computation introduced by activation checkpointing. However, if you want to enable it do the following:

  1. Switch the flag here to True
  2. Replace this import with from deepspeed.runtime.activation_checkpointing.checkpointing import checkpoint
0reactions
hpourmodhejicommented, Mar 4, 2022

@tjruwase, Thank you so much for your help. I have also changed the following line: checkpoint.checkpoint(…) => checkpoint(…). It is working now. Thanks for your patience and help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why is CPU Checkpointing only available with ... - GitHub
"CPU Checkpointing/Contiguous Checkpointing is only available with partitioned activations." It seems to me like one could be implemented ...
Read more >
A Brief History of Checkpointing and Its Applications
2 Problem: You Can't Checkpoint the World! ... cause, when an application-specific routine “used to work”, and then it stops working.
Read more >
DeepSpeed - Release 0.7.7 Microsoft
It is because the processes need to work in sync to gather the weights. This method will hang waiting to synchronize with other...
Read more >
Can you predict/design a future computer CPU architecture ...
you can't have all cores working at the same time because that would consume too much power (power density too high); even if...
Read more >
subject:"\[gem5\-users\] Checkpoint" - The Mail Archive
But I'm not sure if this is going to work, never tried that. ... we've got some objects that can't drain immediately, then...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found