Multi-Node deepspeed calling running instead of launcher
See original GitHub issueSystem Info
- `Accelerate` version: 0.10.0
- Platform: Linux-5.10.112-108.499.amzn2.x86_64-x86_64-with-glibc2.2.5
- Python version: 3.7.10
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.12.0+cu113 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: no
- use_cpu: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- main_process_ip: None
- main_process_port: None
- main_training_function: main
- deepspeed_config: {'deepspeed_config_file': '/path/to/deepspeed_config.json', 'zero3_init_flag': False}
- fsdp_config: {}
deepspeed==0.6.5
{
"train_batch_size": 128,
"gradient_accumulation_steps": 1,
"gradient_clipping": 1,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 0.00001
},
"zero_optimization": {
"stage": 2,
"cpu_offload": true,
"contiguous_gradients": true,
"overlap_comm": true
}
}
Running on a slurm HPC.
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - My own task or dataset (give details below)
Reproduction
Install:
pip install accelerate
pip install deepspeed
Create the config:
Note: deepspeed config can be empty as the crash happens before it is opened.
Similarly, main_process_ip
and main_process_port
can be anything as they are not used before the crash.
{
"compute_environment": "LOCAL_MACHINE",
"deepspeed_config": {
"deepspeed_config_file": "/path/to/deepspeed_config.json",
"zero3_init_flag": false
},
"distributed_type": "DEEPSPEED",
"fsdp_config": {},
"machine_rank": 0,
"main_process_ip": "0.0.0.0",
"main_process_port": 0000,
"main_training_function": "main",
"mixed_precision": "no",
"num_machines": 2,
"num_processes": 8,
"use_cpu": false
}
Launch accelerate
Again, since the training script is not launched, it can be empty.
accelerate launch --config_file /path/to/accelerate_config.json /path/to/empty_script.py
Expected behavior
When launching a multinode deepspeed training script with this code https://github.com/huggingface/accelerate/blob/86ce737d7fc94f8000dbd5e13021d0411bb4204a/src/accelerate/commands/launch.py#L312-L327 it appears to be running the deepspeed runner where the arguments supplied are for the deepspeed launcher.
This means that I am getting this error when accelerate tries to launch deepspeed.
usage: deepspeed [-h] [-H HOSTFILE] [-i INCLUDE] [-e EXCLUDE]
[--num_nodes NUM_NODES] [--num_gpus NUM_GPUS]
[--master_port MASTER_PORT] [--master_addr MASTER_ADDR]
[--launcher LAUNCHER] [--launcher_args LAUNCHER_ARGS]
[--force_multi] [--autotuning {tune,run}]
user_script ...
deepspeed: error: unrecognized arguments: --no_local_rank
Since accelerate is performing the same function as the deepspeed runner, I would expect accelerate to call the launcher directly on each of the nodes. Instead, it appears to be calling the runner on each of the nodes.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:10 (5 by maintainers)
Top GitHub Comments
Ok, dug more into it and we need to rework completely
accelerate launch
fordeepspeed
. So, you should use the deepspeed launcher for now while we fix it!Ah, this is different indeed! I’ll have a look.