Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Initialize the default process group twice, When integrating with DeepSpeed

See original GitHub issue

System Info

- `Accelerate` version: 0.12.0.dev0
- Platform: Linux-4.15.0-151-generic-x86_64-with-glibc2.27
- Python version: 3.9.12
- Numpy version: 1.22.3
- PyTorch version (GPU?): 1.12.0 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3}
        - fsdp_config: {}

If I run code
'''
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, train_dataloader, eval_dataloader)
'''
RuntimeError: trying to initialize the default process group twice!

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

accelerator = Accelerator() model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader)

Expected behavior

I want no error during integrating with deepspeed

Issue Analytics

State:
Created a year ago
Comments:5

Top GitHub Comments

1reaction

pacman100commented, Jul 25, 2022

The above merged PR should solve this issue and folks can now use latest DeepSpeed version without any problem.

0reactions

wookjeHancommented, Jul 20, 2022

Thanks! It works for me!

Top Results From Across the Web

Distributed training in PyTorch and init_process_group - Ray

The error I get is: RuntimeError: trying to initialize the default process group twice! Does Tune have the ability to allow us to...

Train 1 trillion+ parameter models - PyTorch Lightning

Lightning integration of optimizer sharded training provided by FairScale. The technique can be found within DeepSpeed ZeRO and ZeRO-2, ...

Efficient Training on a Single GPU - Hugging Face

In this section we have a look at a few tricks to reduce the memory footprint and speed up training for large models...

Per device APIs — Gaudi Documentation

The identifier of the target AIP. Return Value: Product name. Raises: HLMLError_Uninitialized if the library has not been successfully initialized.

PyTorch Lightning 1.5 Released - Exxact Corporation

DeepSpeed is a deep learning training optimization library, providing the ... __init__ if the PyTorch version supports ShardedTensor (#8944) ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Initialize the default process group twice, When integrating with DeepSpeed

System Info

Information

Tasks

Reproduction

Expected behavior

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

caffe2 error in forward method when using fsdp

Results on TPU worse than on GPU (using colab)