Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi-Node Training No progress on Azure

See original GitHub issue

System Info

accelerate_Version: 0.12.0, 
OS: Ubuntu18.04,
python version:3.8
torch version: 1.11.0
accelerate config: Root node(0) - {
  "compute_environment": "LOCAL_MACHINE",
  "deepspeed_config": {},
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "fsdp_config": {},
  "machine_rank": 0,
  "main_process_ip": "20.169.144.69",
  "main_process_port": 46585,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 2,
  "num_processes": 4,
  "rdzv_backend": "static",
  "use_cpu": false
} 
Node 1:{
  "compute_environment": "LOCAL_MACHINE",
  "deepspeed_config": {},
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "fsdp_config": {},
  "machine_rank": 1,
  "main_process_ip": "20.169.144.69",
  "main_process_port": 51731,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 2,
  "num_processes": 4,
  "rdzv_backend": "static",
  "use_cpu": false
}

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Use two azure VMs for multi-node training

Expected behavior

The expected behaviour was that both machine start training and either communticate with each to sync or provide an error if that fails but the training did not proceed on either of the nodes. Tried using NCC_DEBUG=INFO to check for network issues but did not get any prompts. Not sure what I could be missing here as I followed the items mentioned in previous issues which were closed for multi-node such as https://github.com/huggingface/accelerate/issues/609 and https://github.com/huggingface/accelerate/issues/412

Issue Analytics

State:
Created a year ago
Comments:13

Top GitHub Comments

1reaction

muellerzrcommented, Sep 27, 2022

Labeling this as “feature request” for now since after some 1:1 debugging it seems to be related with the Azure platform, and we don’t have access to Azure machines with GPUs (yet) to test this out. But we’ll get them soon

0reactions

vishalghorcommented, Sep 19, 2022

Thanks @muellerzr . Looking forward to your feedback over it. 😃