question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] version `0.6.7` is throwing `RuntimeError: trying to initialize the default process group twice!`

See original GitHub issue

Describe the bug Using Accelerate integration of DeepSpeed with below config.

- `Accelerate` version: 0.12.0.dev0
- Platform: Linux-5.4.0-121-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.23.0
- PyTorch version (GPU?): 1.12.0+cu102 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
        - fsdp_config: {}

To Reproduce Steps to reproduce the behavior:

  1. Run accelerate config command to set the above DeepSpeed config
  2. RUn accelerate launch complete_nlp_example.py. It is the official example complete_nlp_example.py
  3. Throws below error when using 0.6.7 version whereas version 0.6.5 works fine.
File "complete_nlp_example.py", line 128, in training_function
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
  File "/home/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 446, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/home/sourab/transformers/src/transformers/modeling_utils.py", line 2065, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 655, in __init__
    init_distributed()
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 427, in init_distributed
    cdb = TorchBackend(dist_backend, timeout, init_method)
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 35, in __init__
    self.init_process_group(backend, timeout, init_method)
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 38, in init_process_group
    return torch.distributed.init_process_group(backend,
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 563, in init_process_group
    raise RuntimeError("trying to initialize the default process group " "twice!")
RuntimeError: trying to initialize the default process group twice!

Expected behavior No error when using latest version 0.6.7

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/sourab/dev/lib/python3.8/site-packages/torch']
torch version .................... 1.12.0+cu102
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/sourab/dev/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 10.2

System info (please complete the following information):

  • OS: Ubuntu 20.04.3 LTS (Focal Fossa)
  • GPU count and types: 1 machine with x2 NVIDIA TITAN RTX each
  • Python version: Python 3.8.10

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else? Accelerate launcher which just triggers deepspeed launcher

Additional context original issue raised in Accelerate repo https://github.com/huggingface/accelerate/issues/536

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
jeffracommented, Jul 22, 2022

Awesome, thank you @pacman100! Accelerate is an important integration for us, this will help us avoid issues like this in the future 😃

1reaction
Quentin-Anthonycommented, Jul 21, 2022

@pacman100@awan-10 and I are able to run this example if we perform the following:

  1. Use the DeepSpeed branch in https://github.com/microsoft/DeepSpeed/pull/2121
  2. Checkout an accelerate commit older than https://github.com/huggingface/accelerate/commit/164943c7d7bf5a84b883d9c0ad89796780ef6292 (since this commit leads to the error: AttributeError: 'Accelerator' object has no attribute 'gather_for_metrics')
  3. Run with TF_FORCE_GPU_ALLOW_GROWTH=true accelerate launch complete_nlp_example.py

Please let me know if this works for you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Distributed training in PyTorch and init_process_group - Ray
The error I get is: RuntimeError: trying to initialize the default process group twice! Does Tune have the ability to allow us to...
Read more >
Bug listing with status RESOLVED with resolution FIXED as at ...
Bug :2 - "How do I attach an ebuild. ... Bug:78 - "Slib Is listed as a dependency twice in gnucash ebuild file"...
Read more >
Update the process group in torch.distributed created using ...
However, reinitializing issues this error. RuntimeError: trying to initialize the default process group twice.
Read more >
Change Log of simuPOP - SourceForge
This page lists Change logs of all official simuPOP releases. If you would like to see what has been changed since the last...
Read more >
Source code for torch.distributed.distributed_c10d - AI研习社
def _get_default_group(): """ Getting the default process group created by ... raise RuntimeError("trying to initialize the default process group " "twice!
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found