Multi-Node Training No progress on Azure
See original GitHub issueSystem Info
accelerate_Version: 0.12.0,
OS: Ubuntu18.04,
python version:3.8
torch version: 1.11.0
accelerate config: Root node(0) - {
"compute_environment": "LOCAL_MACHINE",
"deepspeed_config": {},
"distributed_type": "MULTI_GPU",
"downcast_bf16": false,
"fsdp_config": {},
"machine_rank": 0,
"main_process_ip": "20.169.144.69",
"main_process_port": 46585,
"main_training_function": "main",
"mixed_precision": "no",
"num_machines": 2,
"num_processes": 4,
"rdzv_backend": "static",
"use_cpu": false
}
Node 1:{
"compute_environment": "LOCAL_MACHINE",
"deepspeed_config": {},
"distributed_type": "MULTI_GPU",
"downcast_bf16": false,
"fsdp_config": {},
"machine_rank": 1,
"main_process_ip": "20.169.144.69",
"main_process_port": 51731,
"main_training_function": "main",
"mixed_precision": "no",
"num_machines": 2,
"num_processes": 4,
"rdzv_backend": "static",
"use_cpu": false
}
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - My own task or dataset (give details below)
Reproduction
Use two azure VMs for multi-node training
Expected behavior
The expected behaviour was that both machine start training and either communticate with each to sync or provide an error if that fails but the training did not proceed on either of the nodes. Tried using NCC_DEBUG=INFO to check for network issues but did not get any prompts. Not sure what I could be missing here as I followed the items mentioned in previous issues which were closed for multi-node such as https://github.com/huggingface/accelerate/issues/609 and https://github.com/huggingface/accelerate/issues/412
Issue Analytics
- State:
- Created a year ago
- Comments:13
Top Results From Across the Web
Multi-node training on 2 A100 machines. · Issue #609 - GitHub
Hi, I am trying to pretrain a wav2vec2 model on custom dataset am trying to run it on multiple Azure A100 virtual machines....
Read more >Interact with your jobs (debug and monitor) - Azure
Debug jobs and monitor training progress (preview) ... Custom distributed training setup (configuring multi-node training without using the ...
Read more >Create an Azure Machine Learning compute cluster
Azure Machine Learning compute cluster is a managed-compute infrastructure that allows you to easily create a single or multi-node compute.
Read more >Accelerating Distributed Training in Azure Machine Learning ...
We can see that across models and GPU configurations SR-IOV offers 2-3 times improvement over No SR-IOV. thumbnail image 3 of blog post...
Read more >Distributed training - Azure Databricks | Microsoft Learn
For these workloads, Databricks Runtime ML includes the Horovod and spark-tensorflow-distributor packages. Note. Databricks does not recommend ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Labeling this as “feature request” for now since after some 1:1 debugging it seems to be related with the Azure platform, and we don’t have access to Azure machines with GPUs (yet) to test this out. But we’ll get them soon
Thanks @muellerzr . Looking forward to your feedback over it. 😃