question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi-Node deepspeed calling running instead of launcher

See original GitHub issue

System Info

- `Accelerate` version: 0.10.0
- Platform: Linux-5.10.112-108.499.amzn2.x86_64-x86_64-with-glibc2.2.5
- Python version: 3.7.10
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.12.0+cu113 (True)
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: DEEPSPEED
	- mixed_precision: no
	- use_cpu: False
	- num_processes: 8
	- machine_rank: 0
	- num_machines: 1
	- main_process_ip: None
	- main_process_port: None
	- main_training_function: main
	- deepspeed_config: {'deepspeed_config_file': '/path/to/deepspeed_config.json', 'zero3_init_flag': False}
	- fsdp_config: {}

deepspeed==0.6.5

{
    "train_batch_size": 128,
    "gradient_accumulation_steps": 1,
    "gradient_clipping": 1,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 0.00001
    },
    "zero_optimization": {
        "stage": 2,
        "cpu_offload": true,
        "contiguous_gradients": true,
        "overlap_comm": true
    }
}

Running on a slurm HPC.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Install: pip install accelerate pip install deepspeed

Create the config: Note: deepspeed config can be empty as the crash happens before it is opened. Similarly, main_process_ip and main_process_port can be anything as they are not used before the crash.

{
  "compute_environment": "LOCAL_MACHINE",
  "deepspeed_config": {
    "deepspeed_config_file": "/path/to/deepspeed_config.json",
    "zero3_init_flag": false
  },
  "distributed_type": "DEEPSPEED",
  "fsdp_config": {},
  "machine_rank": 0,
  "main_process_ip": "0.0.0.0",
  "main_process_port": 0000,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 2,
  "num_processes": 8,
  "use_cpu": false
}

Launch accelerate Again, since the training script is not launched, it can be empty. accelerate launch --config_file /path/to/accelerate_config.json /path/to/empty_script.py

Expected behavior

When launching a multinode deepspeed training script with this code https://github.com/huggingface/accelerate/blob/86ce737d7fc94f8000dbd5e13021d0411bb4204a/src/accelerate/commands/launch.py#L312-L327 it appears to be running the deepspeed runner where the arguments supplied are for the deepspeed launcher.

This means that I am getting this error when accelerate tries to launch deepspeed.

usage: deepspeed [-h] [-H HOSTFILE] [-i INCLUDE] [-e EXCLUDE]
                 [--num_nodes NUM_NODES] [--num_gpus NUM_GPUS]
                 [--master_port MASTER_PORT] [--master_addr MASTER_ADDR]
                 [--launcher LAUNCHER] [--launcher_args LAUNCHER_ARGS]
                 [--force_multi] [--autotuning {tune,run}]
                 user_script ...
deepspeed: error: unrecognized arguments: --no_local_rank

Since accelerate is performing the same function as the deepspeed runner, I would expect accelerate to call the launcher directly on each of the nodes. Instead, it appears to be calling the runner on each of the nodes.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
sguggercommented, Jul 7, 2022

Ok, dug more into it and we need to rework completely accelerate launch for deepspeed. So, you should use the deepspeed launcher for now while we fix it!

1reaction
sguggercommented, Jul 7, 2022

Ah, this is different indeed! I’ll have a look.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting Started - DeepSpeed
As described above, DeepSpeed provides its own parallel launcher to help launch multi-node/multi-gpu training jobs. If you prefer to launch your ...
Read more >
DeepSpeed Integration - Hugging Face
Here is an example of running run_translation.py under DeepSpeed deploying all available ... The following documentation discusses the launcher options.
Read more >
Training On Multiple Nodes With DeepSpeed
Configuring Training¶. When running Deep Speed and Hugging Face, it is necessary to specify a collection of training settings in a DeepSpeed json...
Read more >
https://aiqianji.com/openoker/DeepSpeed/raw/fs-sof...
To get started with DeepSpeed on AzureML, please see the [AzureML Examples ... DeepSpeed calls the `step()` method of the scheduler at every...
Read more >
PyTorch Lightning 1.5 Released - Exxact Corporation
DeepSpeed is a deep learning training optimization library, providing the ... Call reset_on_restart in the loop's reset hook instead of when ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found