question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: INTERNAL: Core halted unexpectedly: No error message available as no compiler metadata was provided.

See original GitHub issue

The script runs normally on a Cloud TPU v2-8 VM before, but now it shows an error:

import os
os.environ['XLA_PYTHON_CLIENT_ALLOCATOR'] = 'platform'

import jax
import subprocess
np = jax.numpy

devices = jax.devices()

def show_mem(result: np.ndarray) -> str:
    result.block_until_ready()
    jax.profiler.save_device_memory_profile('/tmp/memory.prof')
    return subprocess.run(['go', 'tool', 'pprof', '-tags', '/tmp/memory.prof'], stdout=subprocess.PIPE, stderr=subprocess.DEVNULL).stdout.decode('utf-8')

def largest_v2() -> np.ndarray:
    return np.zeros((1024, 1024, 957, 2), dtype=np.float32)

# print(show_mem(largest_v2()))

print(show_mem(jax.jit(largest_v2, device=devices[1])()))
print(show_mem(jax.jit(largest_v2, device=devices[2])()))

Error message:

$ python test_memory.py
 device: Total 7.5GB
         7.5GB (  100%): TPU_1(process=0,(0,0,0,1))

 kind: Total 7.5GB
         7.5GB (  100%): buffer
       -1.0B (1.2e-08%): executable


2022-02-19 21:29:36.266338: W external/org_tensorflow/tensorflow/stream_executor/stream.cc:275] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state
Traceback (most recent call last):
  File "test_memory.py", line 22, in <module>
    print(show_mem(jax.jit(largest_v2, device=devices[2])()))
  File "test_memory.py", line 12, in show_mem
    result.block_until_ready()
RuntimeError: INTERNAL: Core halted unexpectedly: No error message available as no compiler metadata was provided.
2022-02-19 21:29:36.404625: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/local_device_state.cc:74] Error when closing device: INTERNAL: Core halted unexpectedly: No error message available as no compiler metadata was provided.
2022-02-19 21:29:36.404907: W external/org_tensorflow/tensorflow/stream_executor/stream.cc:275] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state
2022-02-19 21:29:36.405494: W external/org_tensorflow/tensorflow/stream_executor/stream.cc:275] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state

Library versions:

$ pip list | grep jax
jax                      0.3.1
jaxlib                   0.3.0

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
skyecommented, Mar 4, 2022

Thanks for the speedy replies!

For the libtpu-nightly==0.1.dev20220218 SIGABRT failure, please feel free to report that kind of thing here! In this case, we’re already aware of the issue and should have a fixed libtpu-nightly out soon (apologies for suggesting you try it, I forgot about this issue).

Thanks also for isolating where the Core halted unexpectedly error began. This will help with debugging.

1reaction
young-gengcommented, Mar 2, 2022

I’m using different code but encountered the same error message. Here’s my Jax and libtpu version:

$ pip list | grep libtpu
libtpu-nightly                    0.1.dev20220128

$ pip list | grep jax
jax                               0.3.1
jaxlib                            0.3.0

I’ve attached my tpu_driver.INFO in this gist.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Database Engine events and errors - SQL Server
Consult this MSSQL error code list to find explanations for error messages for SQL Server database engine events.
Read more >
Warnings and Errors - Oracle Help Center
Error number Error or warning message Details 403 Attempt to read from checkpoint truncated 412 Bad file‑open mode Internal error. Contact TimesTen C... 413 Bad file‑exists...
Read more >
Troubleshoot Dataflow errors - Google Cloud
This error occurs if the pipeline could not be started due to Google Compute Engine metadata limits being exceeded. These limits cannot be...
Read more >
Bug listing with status UNCONFIRMED as at 2022/12/20 15 ...
Bug:128538 - "sys-apps/coreutils: /bin/hostname should be installed from coreutils not sys-apps/net-tools" status:UNCONFIRMED resolution: severity:enhancement ...
Read more >
$atan - Rocket Software Documentation
Browser Displays Page with HTTP 500 - Internal server error ... Model refers to input field FieldName, which is not found in the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found