[BUG] Dead lock when 'Downloading parameters' cost Took too much time
See original GitHub issueDescribe the bug while running hivemind albert experiment, we have one monitor peer and two worker peers. One of the nodes is working fine
But the other peer is stack at downloading parameters from peer peer log is:
[2021/11/01 07:21:50.962][INFO][averaging.averager._load_state_from_peers:577] Downloading parameters from peer QmYQsw4kqPujvWv52sFsCorZs69LNhkxAsgBhBwGbaFfez
[2021/11/01 07:28:17.871][INFO][averaging.averager._load_state_from_peers:597] Finished downloading state from QmYQsw4kqPujvWv52sFsCorZs69LNhkxAsgBhBwGbaFfez
/opt/conda/lib/python3.9/site-packages/transformers/trainer.py:1347: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
nn.utils.clip_grad_norm_(
[2021/11/01 07:28:18.759][INFO][__main__.on_step_end:153] Step 0
[2021/11/01 07:28:18.760][INFO][__main__.on_step_end:154] Your current contribution: 0 samples
[2021/11/01 07:28:18.760][INFO][__main__.on_step_end:155] Performance: 0.002546124167199564 samples per second.
[2021/11/01 07:28:18.760][INFO][__main__.on_step_end:157] Local loss: 11.4107
[2021/11/01 07:28:18.986][INFO][optim.collaborative.fetch_collaboration_state:442] Collaboration accumulated 81 samples from 1 peers; ETA 36.99 seconds (refresh in 9.25s.)
[2021/11/01 07:28:19.004][INFO][optim.collaborative.step:208] Peer is out of sync.
[2021/11/01 07:28:20.243][INFO][averaging.averager._load_state_from_peers:577] Downloading parameters from peer QmXpVXnAY6L7WqeW4pzstGK18S1LySDonPmrxQka3GztJa
To Reproduce the monitor running script: python run_training_monitor.py --host_maddrs ‘/ip4/0.0.0.0/tcp/38888’ --experiment_prefix albert --wandb_project albert
the worker peer script: python run_trainer.py --experiment_prefix albert --host_maddrs ‘/ip4/0.0.0.0/tcp/39997’ --initial_peers [INITIAL_PEERS_FROM_MONITOR] --seed 42 --logging_first_step --logging_steps 100 --output_dir /train --overwrite_output_dir --logging_dir /train --target_batch_size 1024 --averaging_expiration 10 --per_device_train_batch_size 1 --gradient_accumulation_steps 1
Environment I was running this experiment in a docker container Please list:
- python version 3.9.7
- hivemind.version; 0.10.0
- Please copy and paste the output from pytorch [environment collection script]
Collecting environment information...
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31
Python version: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-37-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 3090
Nvidia driver version: 460.91.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.3
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.10.0
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.0
[pip3] torchvision==0.11.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.74 h6bb024c_0 nvidia
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.3.0 h06a4308_520
[conda] mkl-service 2.4.0 py39h7f8727e_0
[conda] mkl_fft 1.3.1 py39hd3c417c_0
[conda] mkl_random 1.2.2 py39h51133e4_0
[conda] mypy-extensions 0.4.3 pypi_0 pypi
[conda] numpy 1.21.3 pypi_0 pypi
[conda] numpy-base 1.21.2 py39h79a1101_0
[conda] pytorch 1.10.0 py3.9_cuda11.1_cudnn8.0.5_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] torch 1.10.0 pypi_0 pypi
[conda] torch-optimizer 0.3.0 pypi_0 pypi
[conda] torchaudio 0.10.0 pypi_0 pypi
[conda] torchvision 0.11.1 pypi_0 pypi
Considering the file transfer speed, I tested the bandwidth with iperf:
------------------------------------------------------------
Client connecting to 10.8.0.4, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.8.0.5 port 39674 connected with 10.8.0.4 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.4 sec 21.0 MBytes 16.9 Mbits/sec
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
I tried the code on master and it was fine
Closing as inactive [feel free to open a new issue if the problem persists]