question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi-Node Training No progress on Azure

See original GitHub issue

System Info

accelerate_Version: 0.12.0, 
OS: Ubuntu18.04,
python version:3.8
torch version: 1.11.0
accelerate config: Root node(0) - {
  "compute_environment": "LOCAL_MACHINE",
  "deepspeed_config": {},
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "fsdp_config": {},
  "machine_rank": 0,
  "main_process_ip": "20.169.144.69",
  "main_process_port": 46585,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 2,
  "num_processes": 4,
  "rdzv_backend": "static",
  "use_cpu": false
} 
Node 1:{
  "compute_environment": "LOCAL_MACHINE",
  "deepspeed_config": {},
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "fsdp_config": {},
  "machine_rank": 1,
  "main_process_ip": "20.169.144.69",
  "main_process_port": 51731,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 2,
  "num_processes": 4,
  "rdzv_backend": "static",
  "use_cpu": false
}

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Use two azure VMs for multi-node training

Expected behavior

The expected behaviour was that both machine start training and either communticate with each to sync or provide an error if that fails but the training did not proceed on either of the nodes. Tried using NCC_DEBUG=INFO to check for network issues but did not get any prompts. Not sure what I could be missing here as I followed the items mentioned in previous issues which were closed for multi-node such as https://github.com/huggingface/accelerate/issues/609 and https://github.com/huggingface/accelerate/issues/412

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:13

github_iconTop GitHub Comments

1reaction
muellerzrcommented, Sep 27, 2022

Labeling this as “feature request” for now since after some 1:1 debugging it seems to be related with the Azure platform, and we don’t have access to Azure machines with GPUs (yet) to test this out. But we’ll get them soon

0reactions
vishalghorcommented, Sep 19, 2022

Thanks @muellerzr . Looking forward to your feedback over it. 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi-node training on 2 A100 machines. · Issue #609 - GitHub
Hi, I am trying to pretrain a wav2vec2 model on custom dataset am trying to run it on multiple Azure A100 virtual machines....
Read more >
Interact with your jobs (debug and monitor) - Azure
Debug jobs and monitor training progress (preview) ... Custom distributed training setup (configuring multi-node training without using the ...
Read more >
Create an Azure Machine Learning compute cluster
Azure Machine Learning compute cluster is a managed-compute infrastructure that allows you to easily create a single or multi-node compute.
Read more >
Accelerating Distributed Training in Azure Machine Learning ...
We can see that across models and GPU configurations SR-IOV offers 2-3 times improvement over No SR-IOV. thumbnail image 3 of blog post...
Read more >
Distributed training - Azure Databricks | Microsoft Learn
For these workloads, Databricks Runtime ML includes the Horovod and spark-tensorflow-distributor packages. Note. Databricks does not recommend ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found