data parallel and model parallel at the same time
See original GitHub issueDescribe the bug I am working on a multi-GPU and multi-node application. I want to do model-parallel for each node and data-parallel across the nodes. I followed the suggestion in https://github.com/laekov/fastmoe/issues/105. However, I cannot make it work and I am not sure the value of moe_group on each worker. Could you please help me by looking at my mini-reproducing script? It works when I only do model parallel (group_world_size=4), but it fails when I mix model-parallel and data-parallel (group_world_size=2).
To Reproduce cmd
python -m torch.distributed.launch --nproc_per_node=4 tools/test_moe_grouped_dist/mini_reproduce_group_report.py --group_world_size 4 # works
python -m torch.distributed.launch --nproc_per_node=4 tools/test_moe_grouped_dist/mini_reproduce_group_report.py --group_world_size 2 # not work
code
import argparse
import torch
from torch.distributed import Backend
import fmoe
from fmoe import FMoETransformerMLP
def create_model(num_expert, moe_world_size, moe_group):
# create model architecture
model = FMoETransformerMLP(num_expert, d_model=16, d_hidden=16, world_size=moe_world_size, moe_group=moe_group)
return model
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
parser.add_argument("--group_world_size", type=int)
args = parser.parse_args()
# if args.local_rank != 0:
# def print_pass(*args):
# pass
# builtins.print = print_pass
print("distributing")
local_rank = args.local_rank
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend=Backend.NCCL,
init_method="env://")
group_world_size = args.group_world_size
rank = torch.distributed.get_rank()
group_rank = rank // group_world_size
inner_group_rank = rank % group_world_size
group_size = torch.distributed.get_world_size() // group_world_size
print("group_size is {}".format(group_size))
moe_comm_group_list = [i + group_world_size * group_rank for i in range(group_world_size)]
moe_comm_group = torch.distributed.new_group(moe_comm_group_list)
print("rank {}, moe_comm_group list is {}".format(rank, moe_comm_group_list))
# moe_comm_group = None
model = create_model(num_expert=4 // group_world_size, moe_world_size=group_world_size, moe_group=moe_comm_group)
device = torch.device("cuda:{}".format(args.local_rank))
model.to(device)
x = torch.rand([4, 16, 5, 5]).cuda()
# set model_moe
# moe_sync_group = None
moe_sync_group_list = [inner_group_rank + group_size * i for i in range(group_size)]
print("rank {}, moe_sync_group list is {}".format(rank, moe_sync_group_list))
moe_sync_group = torch.distributed.new_group(moe_sync_group_list)
model = fmoe.DistributedGroupedDataParallel(model, device_ids=[local_rank], output_device=local_rank,
moe_sync_group=moe_sync_group)
model._sync_params()
y = model(x)
y.sum().backward()
model.allreduce_params()
# print("x is {}".format(x))
print("y is {}".format(y))
# print("model.experts.htoh4.weight.grad is {}".format(model.module.experts.htoh4.weight.grad))
log (group_world_size=2)
distributing [630/1826]
distributing
distributing
distributing
group_size is 2
group_size is 2
group_size is 2
group_size is 2
rank 1, moe_comm_group list is [0, 1]
rank 0, moe_comm_group list is [0, 1]
rank 3, moe_comm_group list is [2, 3]
rank 2, moe_comm_group list is [2, 3]
rank 2, moe_sync_group list is [0, 2]
rank 1, moe_sync_group list is [1, 3]
rank 0, moe_sync_group list is [0, 2]
rank 3, moe_sync_group list is [1, 3]
NCCL Error at /home/t-xiaochen/envs/fastmoe/cuda/global_exchange.cpp:121 value 2
Killing subprocess 80153
Killing subprocess 80154
Killing subprocess 80155
Killing subprocess 80156
Traceback (most recent call last):
File "/home/t-xiaochen/.conda/envs/moe/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/t-xiaochen/.conda/envs/moe/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/t-xiaochen/.conda/envs/moe/lib/python3.6/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/home/t-xiaochen/.conda/envs/moe/lib/python3.6/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/t-xiaochen/.conda/envs/moe/lib/python3.6/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/t-xiaochen/.conda/envs/moe/bin/python', '-u', 'tools/test_moe_grouped_dist/mini_reproduce_group_report.py', '--loc
al_rank=3', '--group_world_size', '2']' returned non-zero exit status 255.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Data Parallelism VS Model Parallelism in Distributed Deep ...
One may always see data parallelism and model parallelism in distributed deep learning training. In this blog post, I am going to talk...
Read more >Data parallelism vs. model parallelism - How do they differ in ...
In model parallelism, every model is partitioned into 'N' parts, just like data parallelism, where 'N' is the number of GPUs. Each model...
Read more >Distributed Parallel Training: Data Parallelism and Model ...
Model parallelism shards a model (i.e., its layers or tensors) across multiple cores, unlike data parallelism, replicating the same model for all training...
Read more >Model Parallelism vs Data Parallelism in Unet speedup
In Model Parallelism, one model is divided into N parts (where N is equal to the number of GPUs, in the figure above...
Read more >How to Parallelize Deep Learning on GPUs Part 1/2: Data ...
So what are these two? Data parallelism is when you use the same model for every thread, but feed it with different parts...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I would prefer my multiple nodes code to have the same gradient as when it is trained on a single GPU. Scale difference can lead to hyperparameter change. But it’s OK because we can always divide it from our side. Thank you again for your replying and contribution to this great project.
Here are my scripts. You would need to run the first one to calculate the gradient on a single GPU and save the data. Then you need to run the second one to calculate the gradient on the hybrid setting for calculating the gradient for the same input and model. It worth noting that I change loss to
y.sum(-1).sum(-1).mean().backward()
here for mimicking the common batch reduction fansion in most losses. After diving by the group size, the gradient can matchmodel.module.experts.htoh4.weight.grad / group_world_size
.cmd: python -m torch.distributed.launch --nproc_per_node=1 first.py python -m torch.distributed.launch --nproc_per_node=4 second.py
code: