question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error when running multiprocess inference with nccl

See original GitHub issue

Hi, I am trying to use multiple processes and nccl as backend. Still didn’t try to run distributed because it fails already. I am attaching the output, I hope someone can help. There is an error when loading the model. My pytorch version is 1.9.0+cu111

run pytorch ...
[INFO] 2022-04-07 16:10:54,394 run: Running torch.distributed.run with args: ['/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/distributed/run.py', '--nproc_per_node=2', 'dlrm_s_pytorch.py', '--arch-sparse-feature-size=16', '--arch-mlp-bot=13-512-256-64-16', '--arch-mlp-top=512-256-1', '--data-generation=dataset', '--data-set=kaggle', '--raw-data-file=/tmp/dlrm_rd/train.txt', '--processed-data-file=/tmp/dlrm_rd/kaggleAdDisplayChallenge_processed.npz', '--inference-only', '--loss-function=bce', '--round-targets=True', '--load-model=/tmp/dlrm_rd/criteo-medium-100bat.pt', '--print-freq=1024', '--test-mini-batch-size=50000', '--mini-batch-size=50000', '--num-batches=50000', '--print-time', '--print-wall-time', '--num-workers=16', '--dist-backend=nccl', '--use-gpu']
[INFO] 2022-04-07 16:10:54,396 run: Using nproc_per_node=2.
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[INFO] 2022-04-07 16:10:54,396 api: Starting elastic_operator with launch configs:
  entrypoint       : dlrm_s_pytorch.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 2
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:29500
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

[INFO] 2022-04-07 16:10:54,397 local_elastic_agent: log directory set to: /tmp/torchelastic_nvhce4xu/none_wg3c992e
[INFO] 2022-04-07 16:10:54,397 api: [default] starting workers for entrypoint: python3
[INFO] 2022-04-07 16:10:54,397 api: [default] Rendezvous'ing worker group
[INFO] 2022-04-07 16:10:54,397 static_tcp_rendezvous: Creating TCPStore as the c10d::Store implementation
/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
[INFO] 2022-04-07 16:10:54,400 api: [default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[2, 2]
  global_world_sizes=[2, 2]

[INFO] 2022-04-07 16:10:54,400 api: [default] Starting worker group
[INFO] 2022-04-07 16:10:54,401 __init__: Setting worker0 reply file to: /tmp/torchelastic_nvhce4xu/none_wg3c992e/attempt_0/0/error.json
[INFO] 2022-04-07 16:10:54,401 __init__: Setting worker1 reply file to: /tmp/torchelastic_nvhce4xu/none_wg3c992e/attempt_0/1/error.json
Running on 2 ranks using nccl backend
world size: 2, current rank: 0, local rank: 0
Using 1 GPU(s)...
Reading pre-processed data=/tmp/dlrm_rd/kaggleAdDisplayChallenge_processed.npz
world size: 2, current rank: 1, local rank: 1
Sparse fea = 26, Dense fea = 13
Defined train indices...
Randomized indices across days ...
Split data according to indices...
Reading pre-processed data=/tmp/dlrm_rd/kaggleAdDisplayChallenge_processed.npz
Sparse fea = 26, Dense fea = 13
Defined test indices...
Randomized indices across days ...
Split data according to indices...
Loading saved model /tmp/dlrm_rd/criteo-medium-100bat.pt
Traceback (most recent call last):
  File "/---/dlrm/dlrm_s_pytorch.py", line 1880, in <module>
    run()
  File "/---/dlrm/dlrm_s_pytorch.py", line 1396, in run
    dlrm.load_state_dict(ld_model["state_dict"], False)
  File "/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DLRM_Net:
	size mismatch for emb_l.0.weight: copying a param with shape torch.Size([1460, 16]) from checkpoint, the shape in current model is torch.Size([27, 16]).
	size mismatch for emb_l.1.weight: copying a param with shape torch.Size([583, 16]) from checkpoint, the shape in current model is torch.Size([14992, 16]).
	size mismatch for emb_l.2.weight: copying a param with shape torch.Size([10131227, 16]) from checkpoint, the shape in current model is torch.Size([5461306, 16]).
	size mismatch for emb_l.3.weight: copying a param with shape torch.Size([2202608, 16]) from checkpoint, the shape in current model is torch.Size([10, 16]).
	size mismatch for emb_l.4.weight: copying a param with shape torch.Size([305, 16]) from checkpoint, the shape in current model is torch.Size([5652, 16]).
	size mismatch for emb_l.5.weight: copying a param with shape torch.Size([24, 16]) from checkpoint, the shape in current model is torch.Size([2173, 16]).
	size mismatch for emb_l.6.weight: copying a param with shape torch.Size([12517, 16]) from checkpoint, the shape in current model is torch.Size([4, 16]).
	size mismatch for emb_l.7.weight: copying a param with shape torch.Size([633, 16]) from checkpoint, the shape in current model is torch.Size([7046547, 16]).
	size mismatch for emb_l.8.weight: copying a param with shape torch.Size([3, 16]) from checkpoint, the shape in current model is torch.Size([18, 16]).
	size mismatch for emb_l.9.weight: copying a param with shape torch.Size([93145, 16]) from checkpoint, the shape in current model is torch.Size([15, 16]).
	size mismatch for emb_l.10.weight: copying a param with shape torch.Size([5683, 16]) from checkpoint, the shape in current model is torch.Size([286181, 16]).
	size mismatch for emb_l.11.weight: copying a param with shape torch.Size([8351593, 16]) from checkpoint, the shape in current model is torch.Size([105, 16]).
	size mismatch for emb_l.12.weight: copying a param with shape torch.Size([3194, 16]) from checkpoint, the shape in current model is torch.Size([142572, 16]).
Saved at: epoch = 0/1, batch = 512/512, ntbatch = 25581
Training state: loss = 0.522237
Testing state: accuracy = 76.654 %
time/loss/accuracy (if enabled):
[ERROR] 2022-04-07 16:13:11,217 api: failed (exitcode: 1) local_rank: 1 (pid: 192412) of binary: /---/.conda/envs/dlrm-mpi/bin/python3
[ERROR] 2022-04-07 16:13:11,217 local_elastic_agent: [default] Worker group failed
[INFO] 2022-04-07 16:13:11,217 api: [default] Worker group FAILED. 3/3 attempts left; will restart worker group
[INFO] 2022-04-07 16:13:11,218 api: [default] Stopping worker group
[INFO] 2022-04-07 16:13:11,218 api: [default] Rendezvous'ing worker group
[INFO] 2022-04-07 16:13:11,218 static_tcp_rendezvous: Creating TCPStore as the c10d::Store implementation
[INFO] 2022-04-07 16:13:11,220 api: [default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[2, 2]
  global_world_sizes=[2, 2]

[INFO] 2022-04-07 16:13:11,220 api: [default] Starting worker group
[INFO] 2022-04-07 16:13:11,221 __init__: Setting worker0 reply file to: /tmp/torchelastic_nvhce4xu/none_wg3c992e/attempt_1/0/error.json
[INFO] 2022-04-07 16:13:11,221 __init__: Setting worker1 reply file to: /tmp/torchelastic_nvhce4xu/none_wg3c992e/attempt_1/1/error.json
Traceback (most recent call last):
  File "/---/git/dlrm/dlrm_s_pytorch.py", line 1880, in <module>
    run()
  File "/---/git/dlrm/dlrm_s_pytorch.py", line 1064, in run
    ext_dist.init_distributed(local_rank=args.local_rank, use_gpu=use_gpu, backend=args.dist_backend)
  File "/---/git/dlrm/extend_distributed.py", line 160, in init_distributed
    dist.init_process_group(backend, rank=rank, world_size=size)
  File "/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
Traceback (most recent call last):
  File "/---/git/dlrm/dlrm_s_pytorch.py", line 1880, in <module>
    run()
  File "/---/git/dlrm/dlrm_s_pytorch.py", line 1064, in run
    ext_dist.init_distributed(local_rank=args.local_rank, use_gpu=use_gpu, backend=args.dist_backend)
  File "/---/git/dlrm/extend_distributed.py", line 160, in init_distributed
    dist.init_process_group(backend, rank=rank, world_size=size)
  File "/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/---/.conda/envs/dlrm-mpi/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[ERROR] 2022-04-07 16:43:18,109 api: failed (exitcode: 1) local_rank: 0 (pid: 192531) of binary: /---/.conda/envs/dlrm-mpi/bin/python3
[ERROR] 2022-04-07 16:43:18,110 local_elastic_agent: [default] Worker group failed
[INFO] 2022-04-07 16:43:18,110 api: [default] Worker group FAILED. 2/3 attempts left; will restart worker group
[INFO] 2022-04-07 16:43:18,110 api: [default] Stopping worker group
[INFO] 2022-04-07 16:43:18,110 api: [default] Rendezvous'ing worker group
[INFO] 2022-04-07 16:43:18,110 static_tcp_rendezvous: Creating TCPStore as the c10d::Store implementation
[INFO] 2022-04-07 16:43:18,116 api: [default] Rendezvous complete for workers. Result:
  restart_count=2
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[2, 2]
  global_world_sizes=[2, 2]

[INFO] 2022-04-07 16:43:18,116 api: [default] Starting worker group
[INFO] 2022-04-07 16:43:18,117 __init__: Setting worker0 reply file to: /tmp/torchelastic_nvhce4xu/none_wg3c992e/attempt_2/0/error.json
[INFO] 2022-04-07 16:43:18,117 __init__: Setting worker1 reply file to: /tmp/torchelastic_nvhce4xu/none_wg3c992e/attempt_2/1/error.json

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
SeungsuBaekcommented, May 5, 2022

Hi. I’m using below command line.

python -m torch.distributed.run --nproc_per_node=2  dlrm_s_pytorch.py --arch-mlp-bot=128-64-32 --arch-mlp-top=256-64-1 
--arch-embedding-size=4000000-4000000 --arch-sparse-feature-size=32 --num-indices-per-lookup-fixed=true 
--num-indices-per-lookup=160 --num-batches=100 --mini-batch-size=1024 --arch-interaction-op='cat' 
--print-time --loss-function=bce --round-targets=True --learning-rate=0.1 --memory-map 
--data-generation="random" --dist-backend=nccl --use-gpu 

https://github.com/facebookresearch/dlrm/issues/231#issuecomment-1092113060 I think that it’s because you are loading model’s weight from checkpoint (already saved model). Maybe remove the options about checkpoint(–load-model) or make your model config correctly.

1reaction
cmisalecommented, Apr 22, 2022

Hi @mnaumovfb, did you have the chance to give the distributed version a try?

Read more comments on GitHub >

github_iconTop Results From Across the Web

NCCL error using DDP and PyTorch 1.7 · Issue #4420 - GitHub
The first thing to do whenever a NCCL error happens, as suggested by the NCCL troubleshooting page is to run again with NCCL_DEBUG=WARN...
Read more >
unhandled system error, NCCL version 2.4.8" - Stack Overflow
I have a similar error but with RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729096246/work/torch/lib/c10d/ProcessGroupNCCL.cpp ...
Read more >
Troubleshooting - Horovod documentation - Read the Docs
If you see the error message below, it means NCCL 2 was not found in the standard libraries location. If you have a...
Read more >
Fast Multi-GPU collectives with NCCL | NVIDIA Technical Blog
The goal of NCCL is to deliver topology-aware collectives that can improve the scalability of your multi-GPU applications. By using NCCL you ...
Read more >
Distributed communication package - torch.distributed - PyTorch
Debugging - in case of NCCL failure, you can set NCCL_DEBUG=INFO to print an ... If using multiple processes per machine with nccl...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found