question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ResourceExhaustedError when using TPU

See original GitHub issue

I have a few notebooks on Colab Pro which use TPU and worked perfectly a day ago, but now everything crashes

ResourceExhaustedError: 9 root error(s) found.
  (0) Resource exhausted: {{function_node __inference_train_function_453721}} Compilation failure: Ran out of memory in memory space hbm. Used 16.79G of 7.48G hbm. Exceeded hbm capacity by 9.31G.

Is there any changelog where I can see what did change in Colab, or this something to do with the TPU infrastructure?

I can make it work by reducing the batch size but it has to be reduced like twice making models train at least 2x slower.

I’ve noticed that TensorFlow started to show strange warnings:

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py:601: get_next_as_optional (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.

WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0340s vs `on_train_batch_end` time: 0.4056s). Check your callbacks.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:7
  • Comments:22 (4 by maintainers)

github_iconTop GitHub Comments

7reactions
craigcitrocommented, Aug 5, 2020

For anyone looking to help get this fixed: making comments on the upstream issue is most helpful.

For anyone looking to use TF 2.2 with a TPU for now, this should get you unblocked:

!pip install tensorflow~=2.2.0 tensorflow_gcs_config~=2.2.0
import tensorflow as tf
import requests
import os
resp = requests.post("http://{}:8475/requestversion/{}".format(os.environ["COLAB_TPU_ADDR"].split(":")[0], tf.__version__))
if resp.status_code != 200:
  print("Failed to switch the TPU to TF {}".format(version))
1reaction
graf10acommented, Aug 9, 2020

@gena Great! I am glad it is working for you now!

Read more comments on GitHub >

github_iconTop Results From Across the Web

[TPU/GPU] Resource Exhausted Error, but I don't have a large ...
For some reason, when trying to define my model using transfer learning, I always get a ResourceExhaustedError, no matter how low I try...
Read more >
Use TPU in Google Colab - python - Stack Overflow
You need to create TPU strategy: strategy = tf.distribute.TPUStrategy(resolver). And than use this strategy properly:
Read more >
Troubleshooting TensorFlow - TPU - Google Cloud
ResourceExhaustedError: Ran out of memory in memory space hbm; used: YYY; limit: 7.48G. Frameworks and Configurations Affected.
Read more >
"ResourceExhaustedError: received trailing metadata size ...
Hi! This is my first time training with a TPU in Colab and I am facing an error I have never seen before....
Read more >
Resource exhausted: OOM when allo… - Apple Developer
ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[256,384,3072] and type float on ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found