question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TPU not found on VM

See original GitHub issue

Description

Hello

I’m running a TPU v3-8 VM on Google. On the VM I installed jax with pip install "jax[tpu]==0.2.16" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html.

Unfortunately, I’m getting the message No GPU/TPU found, falling back to CPU. when issuing jax.device_count(). The same holds for pip install jax==0.2.12. Only when I’m using pip install "jax[tpu]>=0.2.16" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html (newest jax version), it works. But I need jax version 0.2.12 or 0.2.16.

How can I get it running with these versions?

What jax/jaxlib version are you using?

jax 0.2.16

Which accelerator(s) are you using?

TPU

Additional system info

No response

NVIDIA GPU info

No response

Issue Analytics

  • State:open
  • Created 10 months ago
  • Reactions:2
  • Comments:10 (1 by maintainers)

github_iconTop GitHub Comments

4reactions
skyecommented, Nov 22, 2022

That didn’t take as long as I expected 😃

For 0.2.16, you can workaround by setting the env var TPU_LIBRARY_PATH=/home/skyewm/.local/lib/python3.8/site-packages/libtpu/libtpu.so. (You may have to adjust that path depending on where libtpu-nightly was installed; locate libtpu.so may be helpful)

The underlying problem is that this version of jax still expected libtpu.so to be automatically installed in the VM image (https://github.com/google/jax/blob/jax-v0.2.16/jax/_src/cloud_tpu_init.py#L104), which the TPU VM base image no longer does.

0reactions
Eichhofcommented, Dec 9, 2022

Please see below the content of tpu_driver.INFO.

I’m not able to upgrade jax because I want to use jax for fine-tuning GPT-J using the following tutorial: https://github.com/kingoflolz/mesh-transformer-jax/blob/master/howto_finetune.md With newer Jax versions this does not work.

Log file created at: 2022/12/09 15:40:15
Running on machine: t1v-n-ee970b5a-w-0
Binary: Built on Jun 10 2021 11:50:32 (1623351002)
Binary: Built at cloud-tpus-runtime-release-tool@jgww9.prod.google.com:/google/src/cloud/buildrabbit-username/buildrabbit-client/g3
Binary: Built for gcc-4.X.Y-crosstool-v18-llvm-grtev4-k8
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1209 15:40:15.567064   13256 b295d63588a.cc:758] Linux version 5.13.0-1027-gcp (buildd@lcy02-amd64-062) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #32~20.04.1-Ubuntu SMP Thu May 26 10:53:08 UTC 2022
I1209 15:40:15.567366   13256 b295d63588a.cc:825] Process id 13256
I1209 15:40:15.567378   13256 b295d63588a.cc:830] Current working directory /home/myUsername
I1209 15:40:15.567379   13256 b295d63588a.cc:832] Current timezone is UTC (currently UTC +00:00)
I1209 15:40:15.567382   13256 b295d63588a.cc:836] Built on Jun 10 2021 11:50:32 (1623351002)
I1209 15:40:15.567382   13256 b295d63588a.cc:837]  at cloud-tpus-runtime-release-tool@jgww9.prod.google.com:/google/src/cloud/buildrabbit-username/buildrabbit-client/g3
I1209 15:40:15.567383   13256 b295d63588a.cc:838]  as //learning/45eac/tfrc/executor:_libtpu.so
I1209 15:40:15.567384   13256 b295d63588a.cc:839]  for gcc-4.X.Y-crosstool-v18-llvm-grtev4-k8
I1209 15:40:15.567385   13256 b295d63588a.cc:842]  from changelist 378699432 with baseline 378699432 in a mint client based on __ar56t/g3
I1209 15:40:15.567386   13256 b295d63588a.cc:846] Build label: libtpu_runtime_20210610_RC00
I1209 15:40:15.567387   13256 b295d63588a.cc:848] Build tool: Bazel, release r4rca-2021.06.04-6 (mainline @377391976)
I1209 15:40:15.567388   13256 b295d63588a.cc:849] Build target:
I1209 15:40:15.567389   13256 b295d63588a.cc:861] Command line arguments:
I1209 15:40:15.567390   13256 b295d63588a.cc:863] argv[0]: './tpu_driver'
I1209 15:40:15.567393   13256 b295d63588a.cc:863] argv[1]: '--minloglevel=0'
I1209 15:40:15.567394   13256 b295d63588a.cc:863] argv[2]: '--stderrthreshold=3'
I1209 15:40:15.567395   13256 b295d63588a.cc:863] argv[3]: '--v=0'
I1209 15:40:15.567396   13256 b295d63588a.cc:863] argv[4]: '--vmodule='
I1209 15:40:15.567397   13256 b295d63588a.cc:863] argv[5]: '--log_dir=/tmp/tpu_logs'
I1209 15:40:15.567398   13256 b295d63588a.cc:863] argv[6]: '--max_log_size=1024'
I1209 15:40:15.567654   13256 builtin.cc:16] 7edfa70aa11b3ffd6f: /memfile/routing_cache_files
I1209 15:40:15.567666   13256 builtin.cc:16] 7edfa70aa11b3ffd6f: /memfile/tpu_chip_config_memfile_default
I1209 15:40:15.567670   13256 builtin.cc:16] 7edfa70aa11b3ffd6f: /memfile/tpu_chip_config_memfile_inference
I1209 15:40:15.567675   13256 builtin.cc:16] 7edfa70aa11b3ffd6f: /memfile/tpu_chip_parts_memfile
I1209 15:40:15.567717   13256 coredump_hook.cc:666] Remote crash gathering hook installed.
I1209 15:40:15.568024   13256 prodhostname_userspace_monitor_impl.cc:188] Not running under a Borglet, disabling ProdHostname userspace monitoring.
W1209 15:40:15.568059   13256 tf_tpu_flags.cc:51] Configuring 2a886c8 Platform flags with tensorflow flags. The original flag values are ignored.
W1209 15:40:15.568136   13256 tf_tpu_flags.cc:78] --2a886c8_chips_per_host_bounds overridden to: {x = 2, y = 2, z = 1}
W1209 15:40:15.568150   13256 tf_tpu_flags.cc:85] --2a886c8_wrap overridden to: {x = false, y = false, z = false}
W1209 15:40:15.568155   13256 tf_tpu_flags.cc:94] --2a886c8_host_bounds overridden to: {x = 1, y = 1, z = 1}
W1209 15:40:15.568161   13256 tf_tpu_flags.cc:98] --2a886c8_missing_chip_count overridden to: 0
I1209 15:40:15.568209   13256 logger.cc:274] Enabling threaded logging for severity WARNING
I1209 15:40:23.491515   13256 device_util.cc:61] Found 4 6bf72d463e chips.
I1209 15:40:23.491567   13256 tpu_version_flag.cc:50] Using auto-detected TPU version 6bf72d463e
I1209 15:40:23.492314   13256 device_util.cc:61] Found 4 6bf72d463e chips.
I1209 15:40:23.492992   13256 device_util.cc:61] Found 4 6bf72d463e chips.
I1209 15:40:23.493635   13256 device_util.cc:61] Found 4 6bf72d463e chips.
I1209 15:40:23.493641   13256 flags_util.cc:215] Using default chip configuration.
I1209 15:40:23.493926   13256 flags_util.cc:330] Picked unused port 56020 as a555f10594 port.
I1209 15:40:23.494705   13256 device_util.cc:61] Found 4 6bf72d463e chips.
I1209 15:40:23.494713   13256 2a886c8_platform.cc:402] Initializing 2a886c8Platform hardware implementation.
I1209 15:40:23.495307   13256 device_util.cc:61] Found 4 6bf72d463e chips.
I1209 15:40:23.495957   13256 device_util.cc:61] Found 4 6bf72d463e chips.
W1209 15:40:23.499049   13256 device_scanner.cc:210] failures while refreshing ba16c7433 device info from files:
FAILED_PRECONDITION: Failed to read file [type.googleapis.com/util.ErrorSpacePayload='util::PosixErrorSpace::Bad file descriptor']
FAILED_PRECONDITION: Failed to read file [type.googleapis.com/util.ErrorSpacePayload='util::PosixErrorSpace::Bad file descriptor']
FAILED_PRECONDITION: Failed to read file [type.googleapis.com/util.ErrorSpacePayload='util::PosixErrorSpace::Bad file descriptor']
Read more comments on GitHub >

github_iconTop Results From Across the Web

TPU not found on Google VM (jax version 0.2.16)
1 Answer 1 ... As mentioned there, " The underlying problem is that this version of jax still expected libtpu.so to be automatically...
Read more >
Cloud TPU VM user's guide
This command lists the Cloud TPU resources in the specified zone. If no resources are currently set up, the output will just show...
Read more >
Can't create Cloud TPU VM/node since May 4 - Issue Tracker
Since some time around May 4, I have not been able to create a Cloud TPU node. "CREATE TPU NODE" GUI crashes when...
Read more >
Google's TPU Research Cloud! Free TPU hardware for Deep ...
Currently, TPU hardware is not available commercially and can only be accessed via Google. ... Run the following command to create a TPU-VM...
Read more >
Use TPUs | TensorFlow Core
They are available through Google Colab, the TPU Research Cloud, and Cloud TPU. ... make sure the missing libraries mentioned above are installed...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found