question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can activation_checkpointing offloads to NVMe?

See original GitHub issue

There is a cpu_checkpointing in config, why can’t offload it to NVMe?

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
ghosthamletcommented, Aug 21, 2021

@tjruwase Thanks. I’m looking forward to that. DeepSpeed is an incredible library, never thought i can train 2.6B params model on an 2080Ti GPU, and with fast enough speed. I leaned much from this library, its articles and papers. Thanks again.

Do you have some roadmap for DeepSpeed project? Like Some new features plan, or some big refactor, like what Transformers(https://github.com/huggingface/transformers/) did (they modularized their single huge classes/files).

0reactions
tjruwasecommented, Aug 19, 2021

@ghosthamlet, I am glad that things are working now. However, I am going to keep this issue open to address the original request.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG] NVMe Offload, error while fetching submodule ... - GitHub
Describe the bug I want to test ZeRO-Infinity NVMe offload for large ... /deepspeed/runtime/activation_checkpointing/checkpointing.py", ...
Read more >
DeepSpeed Integration - Hugging Face
You can choose to offload both optimizer states and params to NVMe, or just one of them or none. For example, if you...
Read more >
Train 1 trillion+ parameter models - PyTorch Lightning
Do not wrap the entire model with activation checkpointing. ... Additionally, DeepSpeed supports offloading to NVMe drives for even larger models, ...
Read more >
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale ...
A critical question of offloading to CPU and NVMe memory is whether their limited ... Model states and activation checkpoints can have varying...
Read more >
Activation Checkpointing - Amazon SageMaker
Activation checkpointing (or gradient checkpointing) is a technique to reduce memory usage by clearing activations of certain layers and recomputing them ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found