Can activation_checkpointing offloads to NVMe?
See original GitHub issueThere is a cpu_checkpointing
in config, why can’t offload it to NVMe?
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (13 by maintainers)
Top Results From Across the Web
[BUG] NVMe Offload, error while fetching submodule ... - GitHub
Describe the bug I want to test ZeRO-Infinity NVMe offload for large ... /deepspeed/runtime/activation_checkpointing/checkpointing.py", ...
Read more >DeepSpeed Integration - Hugging Face
You can choose to offload both optimizer states and params to NVMe, or just one of them or none. For example, if you...
Read more >Train 1 trillion+ parameter models - PyTorch Lightning
Do not wrap the entire model with activation checkpointing. ... Additionally, DeepSpeed supports offloading to NVMe drives for even larger models, ...
Read more >ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale ...
A critical question of offloading to CPU and NVMe memory is whether their limited ... Model states and activation checkpoints can have varying...
Read more >Activation Checkpointing - Amazon SageMaker
Activation checkpointing (or gradient checkpointing) is a technique to reduce memory usage by clearing activations of certain layers and recomputing them ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@tjruwase Thanks. I’m looking forward to that. DeepSpeed is an incredible library, never thought i can train 2.6B params model on an 2080Ti GPU, and with fast enough speed. I leaned much from this library, its articles and papers. Thanks again.
Do you have some roadmap for DeepSpeed project? Like Some new features plan, or some big refactor, like what Transformers(https://github.com/huggingface/transformers/) did (they modularized their single huge classes/files).
@ghosthamlet, I am glad that things are working now. However, I am going to keep this issue open to address the original request.