Activation checkpointing mpu_ problem
See original GitHub issueI was trying to integrate zero offload with activation checkpointing in order to train a model related to point clouds. This is how my config looks>
{
"gradient_accumulation_steps": 1,
"train_micro_batch_size_per_gpu": 7,
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": 3,
"reduce_bucket_size": 2e8,
"allgather_bucket_size": 2e8,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 5e8,
"stage3_param_persistence_threshold": 1e6,
"contiguous_gradients": true,
"overlap_comm": false,
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
}
},
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": true,
"cpu_checkpointing": true,
"number_checkpoints":2
}
}
And this is how i am creating a model:
self.model, self.optimizer,_,self.scheduler = deepspeed.initialize(model=self.model, optimizer=self.optimizer,lr_scheduler=self.scheduler,model_parameters=self.model.parameters(), config='./modules/ds_config.json')
Also, i have made needed changes(custom forward) to my model in order to implement checkpointing for layers like in Megatron. However, at the beginning i should do this>
deepspeed.checkpointing.configure(mpu_, deepspeed_config='./modules/ds_config.json')
But i don’t know what that mpu_ should do and what it it. My goal is training on one gpu so i don’t need parallelism and other stuff. Can you please give me some suggestions where i can dig in order to get that mpu_ ?
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
Wow. Just tried it. Fixed some bugs which forced me to disable contiguous_memory_optimization. I’ll create another ticket for that. But… 5x batch size increase… Is this legal?? Thank you for this amazing tool. Really democratizing training for research projects and students!
So, i have tried: NN + checkpoints on high level: baseline NN + checkpoints on high level + checkpoints on low level: There is no memory improvement, but slower than baseline NN + checkpoints on low level: like baseline, a bit slower than it, but it is much harder to implement for each submodule. Conclusion> Checkpoints on highest level for my use-case is the best option