Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Activation checkpointing mpu_ problem

See original GitHub issue

I was trying to integrate zero offload with activation checkpointing in order to train a model related to point clouds. This is how my config looks>

{
    "gradient_accumulation_steps": 1,
    "train_micro_batch_size_per_gpu": 7,
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 3,
        "reduce_bucket_size": 2e8,
        "allgather_bucket_size": 2e8,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_prefetch_bucket_size": 5e8,
        "stage3_param_persistence_threshold": 1e6,
        "contiguous_gradients": true,
        "overlap_comm": false,
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        }
    },
    "activation_checkpointing": {
        "partition_activations": true,
        "contiguous_memory_optimization": true,
        "cpu_checkpointing": true,
        "number_checkpoints":2
    }
}

And this is how i am creating a model: self.model, self.optimizer,_,self.scheduler = deepspeed.initialize(model=self.model, optimizer=self.optimizer,lr_scheduler=self.scheduler,model_parameters=self.model.parameters(), config='./modules/ds_config.json')

Also, i have made needed changes(custom forward) to my model in order to implement checkpointing for layers like in Megatron. However, at the beginning i should do this> deepspeed.checkpointing.configure(mpu_, deepspeed_config='./modules/ds_config.json') But i don’t know what that mpu_ should do and what it it. My goal is training on one gpu so i don’t need parallelism and other stuff. Can you please give me some suggestions where i can dig in order to get that mpu_ ?

Issue Analytics

State:
Created 2 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

2reactions

Moldoteckcommented, May 18, 2021

Wow. Just tried it. Fixed some bugs which forced me to disable contiguous_memory_optimization. I’ll create another ticket for that. But… 5x batch size increase… Is this legal?? Thank you for this amazing tool. Really democratizing training for research projects and students!

0reactions

Moldoteckcommented, May 19, 2021

So, i have tried: NN + checkpoints on high level: baseline NN + checkpoints on high level + checkpoints on low level: There is no memory improvement, but slower than baseline NN + checkpoints on low level: like baseline, a bit slower than it, but it is much harder to implement for each submodule. Conclusion> Checkpoints on highest level for my use-case is the best option

Top Results From Across the Web

Enhanced Activation Checkpointing - FairScale Documentation

Activation checkpointing is a technique used to reduce GPU memory usage during training. This is done by avoiding the need to store intermediate...

[REQUEST] Activation Checkpoint Prefetch #1575 - GitHub

On A100 server pods, Activation checkpoint in CPU perform worse because of synchronization (HtoD memcpy or all-gather when partitioned ...

Activation Checkpointing - Amazon SageMaker

Activation checkpointing (or gradient checkpointing) is a technique to reduce memory usage by clearing activations of certain layers and recomputing them ...

torch.utils.checkpoint — PyTorch 1.13 documentation

Checkpointing is implemented by rerunning a forward-pass segment for each checkpointed segment during backward. This can cause persistent states like the RNG ...

Low-Memory Neural Network Training - arXiv

We use the checkpointing strategy checkpoint-residual-2*, which reduces the activation memory by approximately 5.8x and increases FLOPs by ...