question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory blows up when training large models on all TPU cores

See original GitHub issue

🐛 Bug

I am training on 8 TPU cores but the memory blows up when the epoch ends.

To Reproduce

Try training a bert large on 8 TPU cores

Expected behavior

Second epoch should get started

Environment

Kaggle TPU

  • PyTorch Lightning Version (e.g., 1.5.0):
  • PyTorch Version (e.g., 1.10):
  • Python version (e.g., 3.9):
  • OS (e.g., Linux): Linux
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source):
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

I will try to use GPU instead of TPUs

cc @kaushikb11 @rohitgr7 @awaelchli @ananthsub @ninginthecloud

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
KrishProcommented, Mar 30, 2022

Actually I know what causes this issue. Earlier i was directly using torch/xla. The memory blows when it tires to save the model at the end on epoch. Like if you would remove the check pointing code this works ok.

The issue is in saving large models trained on TPU (multi-core)

0reactions
KrishProcommented, Apr 9, 2022

Any updates ? Maybe I should close this issue and open a duplicate on torch/xla ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory blowup with TPU Trainer in master #6873 - GitHub
Recent changes to the Trainer for TPU has resulted in memory blowup during training. On a machine with 208GB of RAM [sic], this...
Read more >
Handling big models - Hugging Face
Sharded checkpoints. It's possible your model is so big that even a single copy won't fit in RAM. That doesn't mean it can't...
Read more >
Feeding the Beast: The Data Loading Path for Deep Learning ...
Transferring tensors into the GPU memory (CPU). Using parallelism to achieve throughput. A large amount of I/O, medium-high latency per example, and strong ......
Read more >
Running out of GPU memory with just 3 samples of ...
Hi, I'm training a model with model.fitDataset. The input dimensions are [480, 640, 3] with just 4 outputs of size [1, 4] and...
Read more >
Train With Mixed Precision - NVIDIA Documentation Center
Lowering the required memory enables training of larger models or training ... NVIDIA GPUs offer up to 8x more half precision arithmetic ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found