Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi-Node deepspeed calling running instead of launcher

See original GitHub issue

System Info

- `Accelerate` version: 0.10.0
- Platform: Linux-5.10.112-108.499.amzn2.x86_64-x86_64-with-glibc2.2.5
- Python version: 3.7.10
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.12.0+cu113 (True)
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: DEEPSPEED
	- mixed_precision: no
	- use_cpu: False
	- num_processes: 8
	- machine_rank: 0
	- num_machines: 1
	- main_process_ip: None
	- main_process_port: None
	- main_training_function: main
	- deepspeed_config: {'deepspeed_config_file': '/path/to/deepspeed_config.json', 'zero3_init_flag': False}
	- fsdp_config: {}

deepspeed==0.6.5

{
    "train_batch_size": 128,
    "gradient_accumulation_steps": 1,
    "gradient_clipping": 1,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 0.00001
    },
    "zero_optimization": {
        "stage": 2,
        "cpu_offload": true,
        "contiguous_gradients": true,
        "overlap_comm": true
    }
}

Running on a slurm HPC.

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Install: pip install accelerate pip install deepspeed

Create the config: Note: deepspeed config can be empty as the crash happens before it is opened. Similarly, main_process_ip and main_process_port can be anything as they are not used before the crash.

{
  "compute_environment": "LOCAL_MACHINE",
  "deepspeed_config": {
    "deepspeed_config_file": "/path/to/deepspeed_config.json",
    "zero3_init_flag": false
  },
  "distributed_type": "DEEPSPEED",
  "fsdp_config": {},
  "machine_rank": 0,
  "main_process_ip": "0.0.0.0",
  "main_process_port": 0000,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 2,
  "num_processes": 8,
  "use_cpu": false
}

Launch accelerate Again, since the training script is not launched, it can be empty. accelerate launch --config_file /path/to/accelerate_config.json /path/to/empty_script.py

Expected behavior

When launching a multinode deepspeed training script with this code https://github.com/huggingface/accelerate/blob/86ce737d7fc94f8000dbd5e13021d0411bb4204a/src/accelerate/commands/launch.py#L312-L327 it appears to be running the deepspeed runner where the arguments supplied are for the deepspeed launcher.

This means that I am getting this error when accelerate tries to launch deepspeed.

usage: deepspeed [-h] [-H HOSTFILE] [-i INCLUDE] [-e EXCLUDE]
                 [--num_nodes NUM_NODES] [--num_gpus NUM_GPUS]
                 [--master_port MASTER_PORT] [--master_addr MASTER_ADDR]
                 [--launcher LAUNCHER] [--launcher_args LAUNCHER_ARGS]
                 [--force_multi] [--autotuning {tune,run}]
                 user_script ...
deepspeed: error: unrecognized arguments: --no_local_rank

Since accelerate is performing the same function as the deepspeed runner, I would expect accelerate to call the launcher directly on each of the nodes. Instead, it appears to be calling the runner on each of the nodes.

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:10 (5 by maintainers)

Top GitHub Comments

3reactions

sguggercommented, Jul 7, 2022

Ok, dug more into it and we need to rework completely accelerate launch for deepspeed. So, you should use the deepspeed launcher for now while we fix it!

1reaction

sguggercommented, Jul 7, 2022

Ah, this is different indeed! I’ll have a look.

Top Results From Across the Web

Getting Started - DeepSpeed

As described above, DeepSpeed provides its own parallel launcher to help launch multi-node/multi-gpu training jobs. If you prefer to launch your ...

DeepSpeed Integration - Hugging Face

Here is an example of running run_translation.py under DeepSpeed deploying all available ... The following documentation discusses the launcher options.

Training On Multiple Nodes With DeepSpeed

Configuring Training¶. When running Deep Speed and Hugging Face, it is necessary to specify a collection of training settings in a DeepSpeed json...

https://aiqianji.com/openoker/DeepSpeed/raw/fs-sof...

To get started with DeepSpeed on AzureML, please see the [AzureML Examples ... DeepSpeed calls the `step()` method of the scheduler at every...

PyTorch Lightning 1.5 Released - Exxact Corporation

DeepSpeed is a deep learning training optimization library, providing the ... Call reset_on_restart in the loop's reset hook instead of when ...