question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Activation checkpointing mpu_ problem

See original GitHub issue

I was trying to integrate zero offload with activation checkpointing in order to train a model related to point clouds. This is how my config looks>

{
    "gradient_accumulation_steps": 1,
    "train_micro_batch_size_per_gpu": 7,
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 3,
        "reduce_bucket_size": 2e8,
        "allgather_bucket_size": 2e8,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_prefetch_bucket_size": 5e8,
        "stage3_param_persistence_threshold": 1e6,
        "contiguous_gradients": true,
        "overlap_comm": false,
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        }
    },
    "activation_checkpointing": {
        "partition_activations": true,
        "contiguous_memory_optimization": true,
        "cpu_checkpointing": true,
        "number_checkpoints":2
    }
}

And this is how i am creating a model: self.model, self.optimizer,_,self.scheduler = deepspeed.initialize(model=self.model, optimizer=self.optimizer,lr_scheduler=self.scheduler,model_parameters=self.model.parameters(), config='./modules/ds_config.json')

Also, i have made needed changes(custom forward) to my model in order to implement checkpointing for layers like in Megatron. However, at the beginning i should do this> deepspeed.checkpointing.configure(mpu_, deepspeed_config='./modules/ds_config.json') But i don’t know what that mpu_ should do and what it it. My goal is training on one gpu so i don’t need parallelism and other stuff. Can you please give me some suggestions where i can dig in order to get that mpu_ ?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
Moldoteckcommented, May 18, 2021

Wow. Just tried it. Fixed some bugs which forced me to disable contiguous_memory_optimization. I’ll create another ticket for that. But… 5x batch size increase… Is this legal?? Thank you for this amazing tool. Really democratizing training for research projects and students!

0reactions
Moldoteckcommented, May 19, 2021

So, i have tried: NN + checkpoints on high level: baseline NN + checkpoints on high level + checkpoints on low level: There is no memory improvement, but slower than baseline NN + checkpoints on low level: like baseline, a bit slower than it, but it is much harder to implement for each submodule. Conclusion> Checkpoints on highest level for my use-case is the best option

Read more comments on GitHub >

github_iconTop Results From Across the Web

Enhanced Activation Checkpointing - FairScale Documentation
Activation checkpointing is a technique used to reduce GPU memory usage during training. This is done by avoiding the need to store intermediate...
Read more >
[REQUEST] Activation Checkpoint Prefetch #1575 - GitHub
On A100 server pods, Activation checkpoint in CPU perform worse because of synchronization (HtoD memcpy or all-gather when partitioned ...
Read more >
Activation Checkpointing - Amazon SageMaker
Activation checkpointing (or gradient checkpointing) is a technique to reduce memory usage by clearing activations of certain layers and recomputing them ...
Read more >
torch.utils.checkpoint — PyTorch 1.13 documentation
Checkpointing is implemented by rerunning a forward-pass segment for each checkpointed segment during backward. This can cause persistent states like the RNG ...
Read more >
Low-Memory Neural Network Training - arXiv
We use the checkpointing strategy checkpoint-residual-2*, which reduces the activation memory by approximately 5.8x and increases FLOPs by ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found