question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Google colab tpu_driver: DEADLINE_EXCEEDED

See original GitHub issue

As of this morning, this nerfies training colab notebook was working. For some reason, since a couple of hours, executing this cell:

# @title Configure notebook runtime
# @markdown If you would like to use a GPU runtime instead, change the runtime type by going to `Runtime > Change runtime type`. 
# @markdown You will have to use a smaller batch size on GPU.

runtime_type = 'tpu'  # @param ['gpu', 'tpu']
if runtime_type == 'tpu':
  import jax.tools.colab_tpu
  jax.tools.colab_tpu.setup_tpu()

print('Detected Devices:', jax.devices())

now delivers an error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-2-4e527b212d00> in <module>()
      8   jax.tools.colab_tpu.setup_tpu()
      9 
---> 10 print('Detected Devices:', jax.devices())

2 frames
/usr/local/lib/python3.7/dist-packages/jax/_src/lib/xla_bridge.py in devices(backend)
    312     List of Device subclasses.
    313   """
--> 314   return get_backend(backend).devices()
    315 
    316 

/usr/local/lib/python3.7/dist-packages/jax/_src/lib/xla_bridge.py in get_backend(platform)
    256 @lru_cache(maxsize=None)  # don't use util.memoize because there is no X64 dependence.
    257 def get_backend(platform=None):
--> 258   return _get_backend_uncached(platform)
    259 
    260 

/usr/local/lib/python3.7/dist-packages/jax/_src/lib/xla_bridge.py in _get_backend_uncached(platform)
    246     if backend is None:
    247       if platform in _backends_errors:
--> 248         raise RuntimeError(f"Requested backend {platform}, but it failed "
    249                            f"to initialize: {_backends_errors[platform]}")
    250       raise RuntimeError(f"Unknown backend {platform}")

RuntimeError: Requested backend tpu_driver, but it failed to initialize: DEADLINE_EXCEEDED: Failed to connect to remote server at address: grpc://10.113.198.178:8470. Error from gRPC: Deadline Exceeded. Details: 

I tried changing the TPU driver, by following the recommendation of https://github.com/google/jax/issues/4408 :

import tensorflow as tf
from tf.python.tpu.client.client import Client

c = Client()
c.configure_tpu_version("tpu_driver0.1-dev20200320", restart_type='ifNeeded')
c.wait_for_healthy()

to which the back end never responded:

WARNING:root:Waiting for TPU "grpc://10.36.75.242:8470" with state "None" and health "None" to become healthy
WARNING:root:Waiting for TPU "grpc://10.36.75.242:8470" with state "None" and health "None" to become healthy
WARNING:root:Waiting for TPU "grpc://10.36.75.242:8470" with state "None" and health "None" to become healthy
WARNING:root:Waiting for TPU "grpc://10.36.75.242:8470" with state "None" and health "None" to become healthy
WARNING:root:Waiting for TPU "grpc://10.36.75.242:8470" with state "None" and health "None" to become healthy

I did not change anything to make the code break from morning to this afternoon. I made sure it’s running a TPU, I factory reseted the instance, and I even tried with another Google account.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
jakevdpcommented, Nov 8, 2021

#8485 adds the ability to specify a TPU driver version within setup_tpu().

0reactions
Yuji-githubcommented, Aug 6, 2022

I got the same issues. In my case, I turned off my VPN (surfshark) and re-run the terminal. Then, it worked. I’m not sure VPN is caused by the errors, but it’s considerable.

jax.tools.colab_tpu.setup_tpu() print("TPU: ", jax.devices())

Read more comments on GitHub >

github_iconTop Results From Across the Web

Colab TPU setup fails with nightly driver #8472 - google/jax
Please: Check for duplicate issues. Provide a complete example of how to reproduce the bug, wrapped in triple backticks like this: import ...
Read more >
How to read logs before Deadline Exceeded on Init TPU system
Is there a way to track what is going on behind the scenes with a tf.debugger or something similar? This is the only...
Read more >
Colab notebooks | Cloud TPU
Java is a registered trademark of Oracle and/or its affiliates. Why Google. Choosing Google Cloud ...
Read more >
Step-by-Step Use of Google Colab's Free TPU - Heartbeat
Google Colab TPU Free Service · Basic TensorFlow Functions Required To Use TPU · Convolutional Neural Network: CNN Trained on MNIST Dataset ·...
Read more >
Ever wondered what GPU or TPU Google Colab provides?
nvidia-smi and run on Colab (only if you have an NVIDIA driver installed). The free version of Colab mostly provides a Tesla K80...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found