Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ResourceExhaustedError when using TPU

See original GitHub issue

I have a few notebooks on Colab Pro which use TPU and worked perfectly a day ago, but now everything crashes

ResourceExhaustedError: 9 root error(s) found.
  (0) Resource exhausted: {{function_node __inference_train_function_453721}} Compilation failure: Ran out of memory in memory space hbm. Used 16.79G of 7.48G hbm. Exceeded hbm capacity by 9.31G.

Is there any changelog where I can see what did change in Colab, or this something to do with the TPU infrastructure?

I can make it work by reducing the batch size but it has to be reduced like twice making models train at least 2x slower.

I’ve noticed that TensorFlow started to show strange warnings:

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py:601: get_next_as_optional (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.

WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0340s vs `on_train_batch_end` time: 0.4056s). Check your callbacks.

Issue Analytics

State:
Created 3 years ago
Reactions:7
Comments:22 (4 by maintainers)

Top GitHub Comments

7reactions

craigcitrocommented, Aug 5, 2020

For anyone looking to help get this fixed: making comments on the upstream issue is most helpful.

For anyone looking to use TF 2.2 with a TPU for now, this should get you unblocked:

!pip install tensorflow~=2.2.0 tensorflow_gcs_config~=2.2.0
import tensorflow as tf
import requests
import os
resp = requests.post("http://{}:8475/requestversion/{}".format(os.environ["COLAB_TPU_ADDR"].split(":")[0], tf.__version__))
if resp.status_code != 200:
  print("Failed to switch the TPU to TF {}".format(version))

1reaction

graf10acommented, Aug 9, 2020

@gena Great! I am glad it is working for you now!