Failed to initialize CUDA backend on multi-process distributed environments
See original GitHub issueDescription
The issue is happening when trying to initialize a multi-process multi-GPU environment with Slurm (but I think the problem is external to that).
Take the following simple script
import jax
import logging
logging.getLogger().setLevel(logging.DEBUG)
jax.distributed.initialize()
if jax.process_index() == 0:
print(jax.devices())
print(jax.device_count()) # total number of accelerator devices in the cluster
print(jax.local_device_count()) # number of accelerator devices attached to this host
and executed with srun --gres=gpu:2 --ntasks=2 --nodes=1 python main.py
and it return
INFO:absl:JAX distributed initialized with visible devices: 0
INFO:absl:JAX distributed initialized with visible devices: 1
INFO:absl:Starting JAX distributed service on ainode17:4192
INFO:absl:Connecting to JAX distributed service on ainode17:4192
INFO:absl:Connecting to JAX distributed service on ainode17:4192
DEBUG:absl:Initializing backend 'interpreter'
DEBUG:absl:Initializing backend 'interpreter'
DEBUG:absl:Backend 'interpreter' initialized
DEBUG:absl:Initializing backend 'cpu'
DEBUG:absl:Backend 'cpu' initialized
DEBUG:absl:Initializing backend 'tpu_driver'
INFO:absl:Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
DEBUG:absl:Initializing backend 'cuda'
DEBUG:absl:Backend 'interpreter' initialized
DEBUG:absl:Initializing backend 'cpu'
DEBUG:absl:Backend 'cpu' initialized
DEBUG:absl:Initializing backend 'tpu_driver'
INFO:absl:Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
DEBUG:absl:Initializing backend 'cuda'
2022-09-27 19:23:48.425044: E external/org_tensorflow/tensorflow/compiler/xla/status_macros.cc:57] INTERNAL: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/pjrt/gpu_device.cc:345) local_device->device_ordinal() == local_topology.devices_size()
*** Begin stack trace ***
PyCFunction_Call
_PyObject_MakeTpCall
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
_PyObject_FastCallDict
_PyObject_MakeTpCall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
PyObject_Call
_PyObject_MakeTpCall
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
PyEval_EvalCode
PyRun_SimpleFileExFlags
Py_RunMain
Py_BytesMain
__libc_start_main
_start
*** End stack trace ***
INFO:absl:Unable to initialize backend 'cuda': INTERNAL: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/pjrt/gpu_device.cc:345) local_device->device_ordinal() == local_topology.devices_size()
DEBUG:absl:Initializing backend 'rocm'
INFO:absl:Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA Interpreter Host
DEBUG:absl:Initializing backend 'tpu'
INFO:absl:Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Recently (in 0.3.18) there has been an update to the interface for clusters (Slurm and TPUpods), but it doesn’t look like it’s due to that (i.e. manually setting coordinator_address
, num_processes
and process_id
in distributed.initialize(...)
has the same effect).
Am I doing something wrong?
What jax/jaxlib version are you using?
jax==0.3.18, jaxlib==0.3.15+cuda11.cudnn82
Which accelerator(s) are you using?
GPUs
Additional system info
No response
NVIDIA GPU info
No response
Issue Analytics
- State:
- Created a year ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Cannot re-initialize CUDA in forked subprocess" Displayed in ...
When PyTorch is used to start multiple processes, the following error message is ... size, fn, backend='gloo'): """ Initialize the distributed environment.
Read more >Multiprocessing failed with Torch.distributed.launch module
It's weird. I found out it seems like the Gpu cache release problem of pytorch. I add 'torch.cuda.empty_cached' in somewhere of my code...
Read more >How to solve dist.init_process_group from hanging (or ...
We have been using the environment variable initialization method ... the distributed package. dist.init_process_group(backend, rank=rank, ...
Read more >runtimeerror: cuda error: initialization error when calling ... - You.com
I created a pytest fixture using decorator to create multiple processes (using torch multiprocessing) for running model parallel distributed unit tests ...
Read more >CUDA C++ Best Practices Guide
CUDA C++ Best Practices Guide. The programming guide to using the CUDA Toolkit to obtain the best performance from NVIDIA GPUs.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
This should be fixed with
jax
andjaxlib
0.3.20, which we just released. Please try it out!Thanks for the update. I was compiling the new version, but I guess tomorrow I’ll try now with the precompiled.