why cpu_checkpointing can't work?
See original GitHub issueI have partition_activations and cpu_checkpointing enabled, but it seems like activations still on GPU, I have just one GPU, can’t do model parallel, do cpu_checkpointing just work for model parallel? why single GPU same as 1 GPU model parallel can’t offload all its checkpoints to the CPU? My CPU memory is enough, configs:
{
'zero_optimization': {
'stage': 2,
'cpu_offload': True,
'contiguous_gradients': True,
},
'train_batch_size': 2,
'fp16': {
'enabled': True,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1,
},
"activation_checkpointing": {
"partition_activations": True,
"contiguous_memory_optimization": True,
"cpu_checkpointing": True
},
"wall_clock_breakdown": False,
}
Environment: python 3.6 torch 1.6.0 deepspeed 0.3.7
Issue Analytics
- State:
- Created 3 years ago
- Comments:18 (11 by maintainers)
Top Results From Across the Web
Why is CPU Checkpointing only available with ... - GitHub
"CPU Checkpointing/Contiguous Checkpointing is only available with partitioned activations." It seems to me like one could be implemented ...
Read more >A Brief History of Checkpointing and Its Applications
2 Problem: You Can't Checkpoint the World! ... cause, when an application-specific routine “used to work”, and then it stops working.
Read more >DeepSpeed - Release 0.7.7 Microsoft
It is because the processes need to work in sync to gather the weights. This method will hang waiting to synchronize with other...
Read more >Can you predict/design a future computer CPU architecture ...
you can't have all cores working at the same time because that would consume too much power (power density too high); even if...
Read more >subject:"\[gem5\-users\] Checkpoint" - The Mail Archive
But I'm not sure if this is going to work, never tried that. ... we've got some objects that can't drain immediately, then...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@hpourmodheji, that is helpful context. We did not enable activation checkpointing for bert because models that are less ~1B may not benefit much given the overhead of re-computation introduced by activation checkpointing. However, if you want to enable it do the following:
from deepspeed.runtime.activation_checkpointing.checkpointing import checkpoint
@tjruwase, Thank you so much for your help. I have also changed the following line: checkpoint.checkpoint(…) => checkpoint(…). It is working now. Thanks for your patience and help.