question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

'libtpu.so already in use' but actually not used

See original GitHub issue

Sep 2022 Update

Solution: Run

rm -rf /tmp/libtpu_lockfile /tmp/tpu_logs

before running Python.


Original Post

We can test if TPU is being used by this command:

python -c 'import jax; print(jax.devices())'

In theory, if the TPU is not in use, it will print TpuDevice; otherwise, it will print CpuDevice, and a warning will be shown:

I0000 00:00:1649423660.053391 1924758 f236.cc:165] libtpu.so already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. Not attempting to load libtpu.so in this process.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

However, sometimes the command shows that TPU is being used, but I can be sure that the TPU is not being used. Besides, sudo lsof -w /dev/accel0 shows no process is using TPU.

In order to rule out the possibility that another process that was using TPU just exited, I reran the command for several times and the results are the same.

This bug even happens when I created multiple users on the TPU VM. I login in as one user and it shows that the TPU is in used, but then I immediately log in as another user and it works fine.

I want to help to debug this issue but I don’t know where to start.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
skyecommented, May 3, 2022

Hey @ayaka14732, sorry for the delay! I was hiking in Nepal 🏔️

Just to make sure I understand, is the issue that /tmp/libtpu_lockfile sometimes exists even when no process is using the TPU? I’m not sure what “works” and “not works” means in your comment above.

1reaction
ayaka14732commented, Apr 9, 2022

@skye I found at least one of the cause of the problem:

$ export TF_CPP_MIN_LOG_LEVEL=0
$ python3 -c 'import jax; print(jax.devices())'
2022-04-09 23:15:46.156187: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.156229: I external/org_tensorflow/tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
WARNING: Logging before InitGoogle() is written to STDERR
I0409 23:15:46.181567 1339488 tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.195009: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.198059: I external/org_tensorflow/tensorflow/core/tpu/tpu_executor_dlsym_initializer.cc:68] Libtpu path is: libtpu.so
2022-04-09 23:15:46.198096: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.833145: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:171] XLA service 0x31c7860 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2022-04-09 23:15:46.833194: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Interpreter, <undefined>
2022-04-09 23:15:46.850903: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/tfrt_cpu_pjrt_client.cc:163] TfrtCpuClient created.
2022-04-09 23:15:46.851741: I external/org_tensorflow/tensorflow/stream_executor/tpu/tpu_platform_interface.cc:74] No TPU platform found.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

So I opened tensorflow/core/tpu/tpu_initializer_helper.cc, it shows that the program checks a lock file /tmp/libtpu_lockfile.

So I:

$ sudo rm -f /tmp/libtpu_lockfile
$ python3 -c 'import jax; print(jax.devices())'

Then it complains about /tmp/tpu_logs, so I removed the directory.

After that the command works:

$ python3 -c 'import jax; print(jax.devices())'
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1), TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1), TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1), TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0), TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)]
Read more comments on GitHub >

github_iconTop Results From Across the Web

It takes a lot to be able to find something in yourself like that and ...
I thought TPUs were this weird hardware thing. No no, they're just big Ubuntu servers that have 8 hardware accelerators attached. In the...
Read more >
Tensor Processing Unit (TPU) - PyTorch Lightning
A TPU pod hosts many TPUs on it. Currently, TPU v3 Pod has up to 2048 TPU cores and 32 TiB of memory!...
Read more >
Cloud TPU PyTorch/XLA user guide
Important: You can use TPUs using two different architectures: TPU Nodes and TPU VMs. ... cp gs://tpu-pytorch/v4_wheel/110/libtpu.so /lib/libtpu.so ...
Read more >
What happened to XLA.jl - Machine Learning - Julia Discourse
But I soon discovered that it did not work out with Zygote, and I'm now either writing vectorized function (so keeping to a...
Read more >
Shawn Presser on Twitter: "Wow. I'm SSH'd into a TPU v3-8. It ...
You also get direct access to the hardware now, thanks to libtpu. ... So re: deception use cases (e.g. cheating).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found