`notebook_launcher` fails with `num_processes>=2`
See original GitHub issueIssue
I try to run a model training with the accelerate
package. When I run the training from a script, everything works fine (both single-GPU and multi-GPU). When used notebook_launcher
, the single-GPU setting works without a problem. However, once I try multi-GPU, I received the following error:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
/tmp/ipykernel_189978/2375760645.py in <module>
----> 1 notebook_launcher(pytorch_finetuning.train, training_args, num_processes=2)
~/miniconda3/envs/accelerate/lib/python3.8/site-packages/accelerate/notebook_launcher.py in notebook_launcher(function, args, num_processes, use_fp16, use_port)
116 try:
117 print(f"Launching a training on {num_processes} GPUs.")
--> 118 start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
119 finally:
120 # Clean up the environment variables set.
~/miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
155
156 # Loop on join until it returns True or raises an exception.
--> 157 while not context.join():
158 pass
159
~/miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
116 msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
117 msg += original_trace
--> 118 raise Exception(msg)
119
120
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File ."../miniconda3/envs/accelerate/lib/python3.8/site-packages/accelerate/utils.py", line 570, in __call__
self.launcher(*args)
File "...pytorch_finetuning.py", line 1422, in train
accelerate.Accelerator(fp16=use_fp16) if use_accelerate else None
File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/accelerate/accelerator.py", line 117, in __init__
self.state = AcceleratorState(fp16=fp16, cpu=cpu, deepspeed_plugin=deepspeed_plugin, _from_accelerator=True)
File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/accelerate/state.py", line 178, in __init__
torch.distributed.init_process_group(backend="nccl")
File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File ".../miniconda3/envs/accelerate/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: initialization error
At this moment, I’m not really sure if the issue is on my, acccelerate
or torch
side. Thanks a lot for any suggestion in advance! 😃
Environment info
accelerate
version: 0.5.1transformers
version: 3.5.0- Platform: Ubuntu 20.04.3 LTS
- Python version: 3.8.11 (miniconda distribution)
- PyTorch version (GPU?): 1.7.0+cu110
- Using GPU in script?: yes (2x NVIDIA RTX A4000)
- Using distributed or parallel set-up in script?: yes (
num_process=2
)
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Notebook_launcher set num_processes=2 but it say ...
Link - Launching Multi-Node Training from a Jupyter Environment But it always gets only one GPU in Kaggle Notebook. How to solve this...
Read more >A simple way to train and use PyTorch models with multi-GPU ...
I am getting the following error during training ... v0.3.0 Notebook launcher and multi-node training. Notebook launcher.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Just a quick note. After using the command
notebook_launcher(train_loop, num_processes=7, use_fp16=True)
, I used to get this error:But I fixed it as noted in #56 “CUDA has to be uninitiliazed until inside the trainnig function or the forking process is not happy”. So I removed all calls to
torch.cuda
such asdevice = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
and it worked!@sgugger In my case, I can confirm the problem is dependent on the
torch
version. Distributed training in notebooks works perfectly withtorch==1.8.x+cu111
ortorch==1.9.x+cu111
, but not withtorch==1.7.x+cu110
. Unfortunately, I didn’t find any workaround for the1.7
version and do not exactly know what the real cause of the problem is.