Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Dead lock when 'Downloading parameters' cost Took too much time

See original GitHub issue

Describe the bug while running hivemind albert experiment, we have one monitor peer and two worker peers. One of the nodes is working fine

But the other peer is stack at downloading parameters from peer peer log is:

[2021/11/01 07:21:50.962][INFO][averaging.averager._load_state_from_peers:577] Downloading parameters from peer QmYQsw4kqPujvWv52sFsCorZs69LNhkxAsgBhBwGbaFfez
[2021/11/01 07:28:17.871][INFO][averaging.averager._load_state_from_peers:597] Finished downloading state from QmYQsw4kqPujvWv52sFsCorZs69LNhkxAsgBhBwGbaFfez


/opt/conda/lib/python3.9/site-packages/transformers/trainer.py:1347: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.

  nn.utils.clip_grad_norm_(

[2021/11/01 07:28:18.759][INFO][__main__.on_step_end:153] Step 0

[2021/11/01 07:28:18.760][INFO][__main__.on_step_end:154] Your current contribution: 0 samples

[2021/11/01 07:28:18.760][INFO][__main__.on_step_end:155] Performance: 0.002546124167199564 samples per second.

[2021/11/01 07:28:18.760][INFO][__main__.on_step_end:157] Local loss: 11.4107

[2021/11/01 07:28:18.986][INFO][optim.collaborative.fetch_collaboration_state:442] Collaboration accumulated 81 samples from 1 peers; ETA 36.99 seconds (refresh in 9.25s.)

[2021/11/01 07:28:19.004][INFO][optim.collaborative.step:208] Peer is out of sync.

[2021/11/01 07:28:20.243][INFO][averaging.averager._load_state_from_peers:577] Downloading parameters from peer QmXpVXnAY6L7WqeW4pzstGK18S1LySDonPmrxQka3GztJa

To Reproduce the monitor running script: python run_training_monitor.py --host_maddrs ‘/ip4/0.0.0.0/tcp/38888’ --experiment_prefix albert --wandb_project albert

the worker peer script: python run_trainer.py --experiment_prefix albert --host_maddrs ‘/ip4/0.0.0.0/tcp/39997’ --initial_peers [INITIAL_PEERS_FROM_MONITOR] --seed 42 --logging_first_step --logging_steps 100 --output_dir /train --overwrite_output_dir --logging_dir /train --target_batch_size 1024 --averaging_expiration 10 --per_device_train_batch_size 1 --gradient_accumulation_steps 1

Environment I was running this experiment in a docker container Please list:

python version 3.9.7
hivemind.version; 0.10.0
Please copy and paste the output from pytorch [environment collection script]

Collecting environment information...
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.9.7 (default, Sep 16 2021, 13:09:58)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-37-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 3090
Nvidia driver version: 460.91.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.3
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.10.0
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.0
[pip3] torchvision==0.11.1
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.3.0           h06a4308_520  
[conda] mkl-service               2.4.0            py39h7f8727e_0  
[conda] mkl_fft                   1.3.1            py39hd3c417c_0  
[conda] mkl_random                1.2.2            py39h51133e4_0  
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.21.3                   pypi_0    pypi
[conda] numpy-base                1.21.2           py39h79a1101_0  
[conda] pytorch                   1.10.0          py3.9_cuda11.1_cudnn8.0.5_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] torch                     1.10.0                   pypi_0    pypi
[conda] torch-optimizer           0.3.0                    pypi_0    pypi
[conda] torchaudio                0.10.0                   pypi_0    pypi
[conda] torchvision               0.11.1                   pypi_0    pypi

Considering the file transfer speed, I tested the bandwidth with iperf:

------------------------------------------------------------
Client connecting to 10.8.0.4, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.8.0.5 port 39674 connected with 10.8.0.4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.4 sec  21.0 MBytes  16.9 Mbits/sec