Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`notebook_launcher` fails with `num_processes>=2`

See original GitHub issue

Issue

I try to run a model training with the accelerate package. When I run the training from a script, everything works fine (both single-GPU and multi-GPU). When used notebook_launcher, the single-GPU setting works without a problem. However, once I try multi-GPU, I received the following error:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
/tmp/ipykernel_189978/2375760645.py in <module>
----> 1 notebook_launcher(pytorch_finetuning.train, training_args, num_processes=2)

~/miniconda3/envs/accelerate/lib/python3.8/site-packages/accelerate/notebook_launcher.py in notebook_launcher(function, args, num_processes, use_fp16, use_port)
    116             try:
    117                 print(f"Launching a training on {num_processes} GPUs.")
--> 118                 start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
    119             finally:
    120                 # Clean up the environment variables set.

~/miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    155 
    156     # Loop on join until it returns True or raises an exception.
--> 157     while not context.join():
    158         pass
    159 

~/miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    116         msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
    117         msg += original_trace
--> 118         raise Exception(msg)
    119 
    120 

Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File ."../miniconda3/envs/accelerate/lib/python3.8/site-packages/accelerate/utils.py", line 570, in __call__
    self.launcher(*args)
  File "...pytorch_finetuning.py", line 1422, in train
    accelerate.Accelerator(fp16=use_fp16) if use_accelerate else None
  File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/accelerate/accelerator.py", line 117, in __init__
    self.state = AcceleratorState(fp16=fp16, cpu=cpu, deepspeed_plugin=deepspeed_plugin, _from_accelerator=True)
  File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/accelerate/state.py", line 178, in __init__
    torch.distributed.init_process_group(backend="nccl")
  File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: CUDA error: initialization error

At this moment, I’m not really sure if the issue is on my, acccelerate or torch side. Thanks a lot for any suggestion in advance! 😃

Environment info

accelerate version: 0.5.1
transformers version: 3.5.0
Platform: Ubuntu 20.04.3 LTS
Python version: 3.8.11 (miniconda distribution)
PyTorch version (GPU?): 1.7.0+cu110
Using GPU in script?: yes (2x NVIDIA RTX A4000)
Using distributed or parallel set-up in script?: yes (num_process=2)

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

shahhaard47commented, Dec 11, 2021

Just a quick note. After using the command notebook_launcher(train_loop, num_processes=7, use_fp16=True), I used to get this error:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/nethome/hshah310/miniconda3/envs/cups/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/nethome/hshah310/miniconda3/envs/cups/lib/python3.7/site-packages/accelerate/utils.py", line 571, in __call__
    self.launcher(*args)
  File "/tmp/ipykernel_33040/2138999366.py", line 24, in train_loop
    accelerator = Accelerator()
  File "/nethome/hshah310/miniconda3/envs/cups/lib/python3.7/site-packages/accelerate/accelerator.py", line 117, in __init__
    self.state = AcceleratorState(fp16=fp16, cpu=cpu, deepspeed_plugin=deepspeed_plugin, _from_accelerator=True)
  File "/nethome/hshah310/miniconda3/envs/cups/lib/python3.7/site-packages/accelerate/state.py", line 184, in __init__
    torch.cuda.set_device(self.device)
  File "/nethome/hshah310/miniconda3/envs/cups/lib/python3.7/site-packages/torch/cuda/__init__.py", line 261, in set_device
    torch._C._cuda_setDevice(device)
  File "/nethome/hshah310/miniconda3/envs/cups/lib/python3.7/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

But I fixed it as noted in #56 “CUDA has to be uninitiliazed until inside the trainnig function or the forking process is not happy”. So I removed all calls to torch.cuda such as device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") and it worked!

1reaction

stancldcommented, Oct 11, 2021

@sgugger In my case, I can confirm the problem is dependent on the torch version. Distributed training in notebooks works perfectly with torch==1.8.x+cu111 or torch==1.9.x+cu111, but not with torch==1.7.x+cu110. Unfortunately, I didn’t find any workaround for the 1.7 version and do not exactly know what the real cause of the problem is.