'libtpu.so already in use' but actually not used
See original GitHub issueSep 2022 Update
Solution: Run
rm -rf /tmp/libtpu_lockfile /tmp/tpu_logs
before running Python.
Original Post
We can test if TPU is being used by this command:
python -c 'import jax; print(jax.devices())'
In theory, if the TPU is not in use, it will print TpuDevice
; otherwise, it will print CpuDevice
, and a warning will be shown:
I0000 00:00:1649423660.053391 1924758 f236.cc:165] libtpu.so already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. Not attempting to load libtpu.so in this process.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
However, sometimes the command shows that TPU is being used, but I can be sure that the TPU is not being used. Besides, sudo lsof -w /dev/accel0
shows no process is using TPU.
In order to rule out the possibility that another process that was using TPU just exited, I reran the command for several times and the results are the same.
This bug even happens when I created multiple users on the TPU VM. I login in as one user and it shows that the TPU is in used, but then I immediately log in as another user and it works fine.
I want to help to debug this issue but I don’t know where to start.
Issue Analytics
- State:
- Created a year ago
- Comments:9 (4 by maintainers)
Hey @ayaka14732, sorry for the delay! I was hiking in Nepal 🏔️
Just to make sure I understand, is the issue that
/tmp/libtpu_lockfile
sometimes exists even when no process is using the TPU? I’m not sure what “works” and “not works” means in your comment above.@skye I found at least one of the cause of the problem:
So I opened tensorflow/core/tpu/tpu_initializer_helper.cc, it shows that the program checks a lock file
/tmp/libtpu_lockfile
.So I:
Then it complains about
/tmp/tpu_logs
, so I removed the directory.After that the command works: