Error when running multiprocess inference with nccl
See original GitHub issueHi,
I am trying to use multiple processes and nccl as backend. Still didn’t try to run distributed because it fails already.
I am attaching the output, I hope someone can help. There is an error when loading the model.
My pytorch version is 1.9.0+cu111
run pytorch ...
[INFO] 2022-04-07 16:10:54,394 run: Running torch.distributed.run with args: ['/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/distributed/run.py', '--nproc_per_node=2', 'dlrm_s_pytorch.py', '--arch-sparse-feature-size=16', '--arch-mlp-bot=13-512-256-64-16', '--arch-mlp-top=512-256-1', '--data-generation=dataset', '--data-set=kaggle', '--raw-data-file=/tmp/dlrm_rd/train.txt', '--processed-data-file=/tmp/dlrm_rd/kaggleAdDisplayChallenge_processed.npz', '--inference-only', '--loss-function=bce', '--round-targets=True', '--load-model=/tmp/dlrm_rd/criteo-medium-100bat.pt', '--print-freq=1024', '--test-mini-batch-size=50000', '--mini-batch-size=50000', '--num-batches=50000', '--print-time', '--print-wall-time', '--num-workers=16', '--dist-backend=nccl', '--use-gpu']
[INFO] 2022-04-07 16:10:54,396 run: Using nproc_per_node=2.
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[INFO] 2022-04-07 16:10:54,396 api: Starting elastic_operator with launch configs:
entrypoint : dlrm_s_pytorch.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 2
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
[INFO] 2022-04-07 16:10:54,397 local_elastic_agent: log directory set to: /tmp/torchelastic_nvhce4xu/none_wg3c992e
[INFO] 2022-04-07 16:10:54,397 api: [default] starting workers for entrypoint: python3
[INFO] 2022-04-07 16:10:54,397 api: [default] Rendezvous'ing worker group
[INFO] 2022-04-07 16:10:54,397 static_tcp_rendezvous: Creating TCPStore as the c10d::Store implementation
/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
[INFO] 2022-04-07 16:10:54,400 api: [default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]
[INFO] 2022-04-07 16:10:54,400 api: [default] Starting worker group
[INFO] 2022-04-07 16:10:54,401 __init__: Setting worker0 reply file to: /tmp/torchelastic_nvhce4xu/none_wg3c992e/attempt_0/0/error.json
[INFO] 2022-04-07 16:10:54,401 __init__: Setting worker1 reply file to: /tmp/torchelastic_nvhce4xu/none_wg3c992e/attempt_0/1/error.json
Running on 2 ranks using nccl backend
world size: 2, current rank: 0, local rank: 0
Using 1 GPU(s)...
Reading pre-processed data=/tmp/dlrm_rd/kaggleAdDisplayChallenge_processed.npz
world size: 2, current rank: 1, local rank: 1
Sparse fea = 26, Dense fea = 13
Defined train indices...
Randomized indices across days ...
Split data according to indices...
Reading pre-processed data=/tmp/dlrm_rd/kaggleAdDisplayChallenge_processed.npz
Sparse fea = 26, Dense fea = 13
Defined test indices...
Randomized indices across days ...
Split data according to indices...
Loading saved model /tmp/dlrm_rd/criteo-medium-100bat.pt
Traceback (most recent call last):
File "/---/dlrm/dlrm_s_pytorch.py", line 1880, in <module>
run()
File "/---/dlrm/dlrm_s_pytorch.py", line 1396, in run
dlrm.load_state_dict(ld_model["state_dict"], False)
File "/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DLRM_Net:
size mismatch for emb_l.0.weight: copying a param with shape torch.Size([1460, 16]) from checkpoint, the shape in current model is torch.Size([27, 16]).
size mismatch for emb_l.1.weight: copying a param with shape torch.Size([583, 16]) from checkpoint, the shape in current model is torch.Size([14992, 16]).
size mismatch for emb_l.2.weight: copying a param with shape torch.Size([10131227, 16]) from checkpoint, the shape in current model is torch.Size([5461306, 16]).
size mismatch for emb_l.3.weight: copying a param with shape torch.Size([2202608, 16]) from checkpoint, the shape in current model is torch.Size([10, 16]).
size mismatch for emb_l.4.weight: copying a param with shape torch.Size([305, 16]) from checkpoint, the shape in current model is torch.Size([5652, 16]).
size mismatch for emb_l.5.weight: copying a param with shape torch.Size([24, 16]) from checkpoint, the shape in current model is torch.Size([2173, 16]).
size mismatch for emb_l.6.weight: copying a param with shape torch.Size([12517, 16]) from checkpoint, the shape in current model is torch.Size([4, 16]).
size mismatch for emb_l.7.weight: copying a param with shape torch.Size([633, 16]) from checkpoint, the shape in current model is torch.Size([7046547, 16]).
size mismatch for emb_l.8.weight: copying a param with shape torch.Size([3, 16]) from checkpoint, the shape in current model is torch.Size([18, 16]).
size mismatch for emb_l.9.weight: copying a param with shape torch.Size([93145, 16]) from checkpoint, the shape in current model is torch.Size([15, 16]).
size mismatch for emb_l.10.weight: copying a param with shape torch.Size([5683, 16]) from checkpoint, the shape in current model is torch.Size([286181, 16]).
size mismatch for emb_l.11.weight: copying a param with shape torch.Size([8351593, 16]) from checkpoint, the shape in current model is torch.Size([105, 16]).
size mismatch for emb_l.12.weight: copying a param with shape torch.Size([3194, 16]) from checkpoint, the shape in current model is torch.Size([142572, 16]).
Saved at: epoch = 0/1, batch = 512/512, ntbatch = 25581
Training state: loss = 0.522237
Testing state: accuracy = 76.654 %
time/loss/accuracy (if enabled):
[ERROR] 2022-04-07 16:13:11,217 api: failed (exitcode: 1) local_rank: 1 (pid: 192412) of binary: /---/.conda/envs/dlrm-mpi/bin/python3
[ERROR] 2022-04-07 16:13:11,217 local_elastic_agent: [default] Worker group failed
[INFO] 2022-04-07 16:13:11,217 api: [default] Worker group FAILED. 3/3 attempts left; will restart worker group
[INFO] 2022-04-07 16:13:11,218 api: [default] Stopping worker group
[INFO] 2022-04-07 16:13:11,218 api: [default] Rendezvous'ing worker group
[INFO] 2022-04-07 16:13:11,218 static_tcp_rendezvous: Creating TCPStore as the c10d::Store implementation
[INFO] 2022-04-07 16:13:11,220 api: [default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]
[INFO] 2022-04-07 16:13:11,220 api: [default] Starting worker group
[INFO] 2022-04-07 16:13:11,221 __init__: Setting worker0 reply file to: /tmp/torchelastic_nvhce4xu/none_wg3c992e/attempt_1/0/error.json
[INFO] 2022-04-07 16:13:11,221 __init__: Setting worker1 reply file to: /tmp/torchelastic_nvhce4xu/none_wg3c992e/attempt_1/1/error.json
Traceback (most recent call last):
File "/---/git/dlrm/dlrm_s_pytorch.py", line 1880, in <module>
run()
File "/---/git/dlrm/dlrm_s_pytorch.py", line 1064, in run
ext_dist.init_distributed(local_rank=args.local_rank, use_gpu=use_gpu, backend=args.dist_backend)
File "/---/git/dlrm/extend_distributed.py", line 160, in init_distributed
dist.init_process_group(backend, rank=rank, world_size=size)
File "/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
Traceback (most recent call last):
File "/---/git/dlrm/dlrm_s_pytorch.py", line 1880, in <module>
run()
File "/---/git/dlrm/dlrm_s_pytorch.py", line 1064, in run
ext_dist.init_distributed(local_rank=args.local_rank, use_gpu=use_gpu, backend=args.dist_backend)
File "/---/git/dlrm/extend_distributed.py", line 160, in init_distributed
dist.init_process_group(backend, rank=rank, world_size=size)
File "/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[ERROR] 2022-04-07 16:43:18,109 api: failed (exitcode: 1) local_rank: 0 (pid: 192531) of binary: /---/.conda/envs/dlrm-mpi/bin/python3
[ERROR] 2022-04-07 16:43:18,110 local_elastic_agent: [default] Worker group failed
[INFO] 2022-04-07 16:43:18,110 api: [default] Worker group FAILED. 2/3 attempts left; will restart worker group
[INFO] 2022-04-07 16:43:18,110 api: [default] Stopping worker group
[INFO] 2022-04-07 16:43:18,110 api: [default] Rendezvous'ing worker group
[INFO] 2022-04-07 16:43:18,110 static_tcp_rendezvous: Creating TCPStore as the c10d::Store implementation
[INFO] 2022-04-07 16:43:18,116 api: [default] Rendezvous complete for workers. Result:
restart_count=2
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]
[INFO] 2022-04-07 16:43:18,116 api: [default] Starting worker group
[INFO] 2022-04-07 16:43:18,117 __init__: Setting worker0 reply file to: /tmp/torchelastic_nvhce4xu/none_wg3c992e/attempt_2/0/error.json
[INFO] 2022-04-07 16:43:18,117 __init__: Setting worker1 reply file to: /tmp/torchelastic_nvhce4xu/none_wg3c992e/attempt_2/1/error.json
Issue Analytics
- State:
- Created a year ago
- Comments:11 (5 by maintainers)
Top Results From Across the Web
NCCL error using DDP and PyTorch 1.7 · Issue #4420 - GitHub
The first thing to do whenever a NCCL error happens, as suggested by the NCCL troubleshooting page is to run again with NCCL_DEBUG=WARN...
Read more >unhandled system error, NCCL version 2.4.8" - Stack Overflow
I have a similar error but with RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729096246/work/torch/lib/c10d/ProcessGroupNCCL.cpp ...
Read more >Troubleshooting - Horovod documentation - Read the Docs
If you see the error message below, it means NCCL 2 was not found in the standard libraries location. If you have a...
Read more >Fast Multi-GPU collectives with NCCL | NVIDIA Technical Blog
The goal of NCCL is to deliver topology-aware collectives that can improve the scalability of your multi-GPU applications. By using NCCL you ...
Read more >Distributed communication package - torch.distributed - PyTorch
Debugging - in case of NCCL failure, you can set NCCL_DEBUG=INFO to print an ... If using multiple processes per machine with nccl...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi. I’m using below command line.
https://github.com/facebookresearch/dlrm/issues/231#issuecomment-1092113060 I think that it’s because you are loading model’s weight from checkpoint (already saved model). Maybe remove the options about checkpoint(–load-model) or make your model config correctly.
Hi @mnaumovfb, did you have the chance to give the distributed version a try?