question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`notebook_launcher` fails with `num_processes>=2`

See original GitHub issue

Issue

I try to run a model training with the accelerate package. When I run the training from a script, everything works fine (both single-GPU and multi-GPU). When used notebook_launcher, the single-GPU setting works without a problem. However, once I try multi-GPU, I received the following error:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
/tmp/ipykernel_189978/2375760645.py in <module>
----> 1 notebook_launcher(pytorch_finetuning.train, training_args, num_processes=2)

~/miniconda3/envs/accelerate/lib/python3.8/site-packages/accelerate/notebook_launcher.py in notebook_launcher(function, args, num_processes, use_fp16, use_port)
    116             try:
    117                 print(f"Launching a training on {num_processes} GPUs.")
--> 118                 start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
    119             finally:
    120                 # Clean up the environment variables set.

~/miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    155 
    156     # Loop on join until it returns True or raises an exception.
--> 157     while not context.join():
    158         pass
    159 

~/miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    116         msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
    117         msg += original_trace
--> 118         raise Exception(msg)
    119 
    120 

Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File ."../miniconda3/envs/accelerate/lib/python3.8/site-packages/accelerate/utils.py", line 570, in __call__
    self.launcher(*args)
  File "...pytorch_finetuning.py", line 1422, in train
    accelerate.Accelerator(fp16=use_fp16) if use_accelerate else None
  File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/accelerate/accelerator.py", line 117, in __init__
    self.state = AcceleratorState(fp16=fp16, cpu=cpu, deepspeed_plugin=deepspeed_plugin, _from_accelerator=True)
  File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/accelerate/state.py", line 178, in __init__
    torch.distributed.init_process_group(backend="nccl")
  File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: CUDA error: initialization error

At this moment, I’m not really sure if the issue is on my, acccelerate or torch side. Thanks a lot for any suggestion in advance! 😃


Environment info

  • accelerate version: 0.5.1
  • transformers version: 3.5.0
  • Platform: Ubuntu 20.04.3 LTS
  • Python version: 3.8.11 (miniconda distribution)
  • PyTorch version (GPU?): 1.7.0+cu110
  • Using GPU in script?: yes (2x NVIDIA RTX A4000)
  • Using distributed or parallel set-up in script?: yes (num_process=2)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
shahhaard47commented, Dec 11, 2021

Just a quick note. After using the command notebook_launcher(train_loop, num_processes=7, use_fp16=True), I used to get this error:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/nethome/hshah310/miniconda3/envs/cups/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/nethome/hshah310/miniconda3/envs/cups/lib/python3.7/site-packages/accelerate/utils.py", line 571, in __call__
    self.launcher(*args)
  File "/tmp/ipykernel_33040/2138999366.py", line 24, in train_loop
    accelerator = Accelerator()
  File "/nethome/hshah310/miniconda3/envs/cups/lib/python3.7/site-packages/accelerate/accelerator.py", line 117, in __init__
    self.state = AcceleratorState(fp16=fp16, cpu=cpu, deepspeed_plugin=deepspeed_plugin, _from_accelerator=True)
  File "/nethome/hshah310/miniconda3/envs/cups/lib/python3.7/site-packages/accelerate/state.py", line 184, in __init__
    torch.cuda.set_device(self.device)
  File "/nethome/hshah310/miniconda3/envs/cups/lib/python3.7/site-packages/torch/cuda/__init__.py", line 261, in set_device
    torch._C._cuda_setDevice(device)
  File "/nethome/hshah310/miniconda3/envs/cups/lib/python3.7/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

But I fixed it as noted in #56CUDA has to be uninitiliazed until inside the trainnig function or the forking process is not happy”. So I removed all calls to torch.cuda such as device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") and it worked!

1reaction
stancldcommented, Oct 11, 2021

@sgugger In my case, I can confirm the problem is dependent on the torch version. Distributed training in notebooks works perfectly with torch==1.8.x+cu111 or torch==1.9.x+cu111, but not with torch==1.7.x+cu110. Unfortunately, I didn’t find any workaround for the 1.7 version and do not exactly know what the real cause of the problem is.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Notebook_launcher set num_processes=2 but it say ...
Link - Launching Multi-Node Training from a Jupyter Environment But it always gets only one GPU in Kaggle Notebook. How to solve this...
Read more >
A simple way to train and use PyTorch models with multi-GPU ...
I am getting the following error during training ... v0.3.0 Notebook launcher and multi-node training. Notebook launcher.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found