question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Compilation hangs indefinitely on GPU

See original GitHub issue

I am encountering an issue where compilation on GPU hangs forever in a semi-deterministic way (happens every time, but at slightly different places). All functions have been compiled successfully before (but with different shapes).

This happens in the middle of a huge model code, and I unfortunately haven’t been able to come up with a reproducer. After 2 minutes I get the “slow compile” warning, then all I can do is send SIGKILL.

I have dumped the HLO but it looks inconspicuous to me:

https://gist.github.com/dionhaefner/e5680e131975b6bf566c1e1cbc554476

The only lead I have is that right before it hangs, I do something like this:

# <do computations on GPU with JAX>

import numpy as onp
rhs = onp.asarray(rhs)
x0 = onp.asarray(x0)

linear_solution, info = scipy.sparse.linalg.bicgstab(
    _matrix,
    rhs,
    x0=x0,
    atol=0,
    tol=settings.congr_epsilon,
    maxiter=settings.congr_max_iterations,
    **self._extra_args,
)

return jnp.asarray(linear_solution)

# a couple of lines later everything hangs

If I comment out the BiCG solver everything works.

This happens on JAX built from source and current wheels. Downgrading jaxlib did not help either. Works on jaxlib 0.1.64, albeit poorly (factor of 10 slower for some reason).

If you have any pointer on how to debug this I would be grateful.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
dionhaefnercommented, Jun 15, 2021

FWIW, this does not occur when I do

$ export OMP_NUM_THREADS=1

Could this be SciPy’s internal OpenMP parallelization clashing with JAX’s thread parallelism?

0reactions
dionhaefnercommented, Aug 15, 2022

Seems fixed with recent JAX, thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

OpenCL compilation hangs forever - AMD Community
Unfortuntly the compilation never terminates. - It works fine for NVIDIA GPU and Apple/Intel CPU - It shows proper erros when there are...
Read more >
Cuda-gdb hangs indefinitely - NVIDIA Developer Forums
We are experiencing indefinite hangs using cuda-gdb on certain binaries, but not all. These binaries run fine outside the debugger. The hang ...
Read more >
nvidia-smi hangs indefinitely: what could be the issue?
My suggestion would be: start over with a clean OS load; Follow the instructions for "runfile installer method" in the cuda 7.5 linux...
Read more >
compilation of convolution hangs (cicc) - Google Groups
After compiling for a while, top reports that 'cicc' is taking 100% and that just goes on forever. I've observed this behaviour on...
Read more >
GPU training hangs with tensorflow… | Apple Developer Forums
The core processes with no GPU activity still run but they do nothing and the training hangs forever. The only thing left is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found