Initialize the default process group twice, When integrating with DeepSpeed
See original GitHub issueSystem Info
- `Accelerate` version: 0.12.0.dev0
- Platform: Linux-4.15.0-151-generic-x86_64-with-glibc2.27
- Python version: 3.9.12
- Numpy version: 1.22.3
- PyTorch version (GPU?): 1.12.0 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: no
- use_cpu: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- main_process_ip: None
- main_process_port: None
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3}
- fsdp_config: {}
If I run code
'''
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, train_dataloader, eval_dataloader)
'''
RuntimeError: trying to initialize the default process group twice!
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - My own task or dataset (give details below)
Reproduction
accelerator = Accelerator() model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader)
Expected behavior
I want no error during integrating with deepspeed
Issue Analytics
- State:
- Created a year ago
- Comments:5
Top Results From Across the Web
Distributed training in PyTorch and init_process_group - Ray
The error I get is: RuntimeError: trying to initialize the default process group twice! Does Tune have the ability to allow us to...
Read more >Train 1 trillion+ parameter models - PyTorch Lightning
Lightning integration of optimizer sharded training provided by FairScale. The technique can be found within DeepSpeed ZeRO and ZeRO-2, ...
Read more >Efficient Training on a Single GPU - Hugging Face
In this section we have a look at a few tricks to reduce the memory footprint and speed up training for large models...
Read more >Per device APIs — Gaudi Documentation
The identifier of the target AIP. Return Value: Product name. Raises: HLMLError_Uninitialized if the library has not been successfully initialized.
Read more >PyTorch Lightning 1.5 Released - Exxact Corporation
DeepSpeed is a deep learning training optimization library, providing the ... __init__ if the PyTorch version supports ShardedTensor (#8944) ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The above merged PR should solve this issue and folks can now use latest DeepSpeed version without any problem.
Thanks! It works for me!