question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Dead lock when 'Downloading parameters' cost Took too much time

See original GitHub issue

Describe the bug while running hivemind albert experiment, we have one monitor peer and two worker peers. One of the nodes is working fine

But the other peer is stack at downloading parameters from peer peer log is:

[2021/11/01 07:21:50.962][INFO][averaging.averager._load_state_from_peers:577] Downloading parameters from peer QmYQsw4kqPujvWv52sFsCorZs69LNhkxAsgBhBwGbaFfez
[2021/11/01 07:28:17.871][INFO][averaging.averager._load_state_from_peers:597] Finished downloading state from QmYQsw4kqPujvWv52sFsCorZs69LNhkxAsgBhBwGbaFfez


/opt/conda/lib/python3.9/site-packages/transformers/trainer.py:1347: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.

  nn.utils.clip_grad_norm_(

[2021/11/01 07:28:18.759][INFO][__main__.on_step_end:153] Step 0

[2021/11/01 07:28:18.760][INFO][__main__.on_step_end:154] Your current contribution: 0 samples

[2021/11/01 07:28:18.760][INFO][__main__.on_step_end:155] Performance: 0.002546124167199564 samples per second.

[2021/11/01 07:28:18.760][INFO][__main__.on_step_end:157] Local loss: 11.4107

[2021/11/01 07:28:18.986][INFO][optim.collaborative.fetch_collaboration_state:442] Collaboration accumulated 81 samples from 1 peers; ETA 36.99 seconds (refresh in 9.25s.)

[2021/11/01 07:28:19.004][INFO][optim.collaborative.step:208] Peer is out of sync.

[2021/11/01 07:28:20.243][INFO][averaging.averager._load_state_from_peers:577] Downloading parameters from peer QmXpVXnAY6L7WqeW4pzstGK18S1LySDonPmrxQka3GztJa

To Reproduce the monitor running script: python run_training_monitor.py --host_maddrs ‘/ip4/0.0.0.0/tcp/38888’ --experiment_prefix albert --wandb_project albert

the worker peer script: python run_trainer.py --experiment_prefix albert --host_maddrs ‘/ip4/0.0.0.0/tcp/39997’ --initial_peers [INITIAL_PEERS_FROM_MONITOR] --seed 42 --logging_first_step --logging_steps 100 --output_dir /train --overwrite_output_dir --logging_dir /train --target_batch_size 1024 --averaging_expiration 10 --per_device_train_batch_size 1 --gradient_accumulation_steps 1

Environment I was running this experiment in a docker container Please list:

  • python version 3.9.7
  • hivemind.version; 0.10.0
  • Please copy and paste the output from pytorch [environment collection script]
Collecting environment information...
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.9.7 (default, Sep 16 2021, 13:09:58)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-37-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 3090
Nvidia driver version: 460.91.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.3
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.10.0
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.0
[pip3] torchvision==0.11.1
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.3.0           h06a4308_520  
[conda] mkl-service               2.4.0            py39h7f8727e_0  
[conda] mkl_fft                   1.3.1            py39hd3c417c_0  
[conda] mkl_random                1.2.2            py39h51133e4_0  
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.21.3                   pypi_0    pypi
[conda] numpy-base                1.21.2           py39h79a1101_0  
[conda] pytorch                   1.10.0          py3.9_cuda11.1_cudnn8.0.5_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] torch                     1.10.0                   pypi_0    pypi
[conda] torch-optimizer           0.3.0                    pypi_0    pypi
[conda] torchaudio                0.10.0                   pypi_0    pypi
[conda] torchvision               0.11.1                   pypi_0    pypi

Considering the file transfer speed, I tested the bandwidth with iperf:

------------------------------------------------------------
Client connecting to 10.8.0.4, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.8.0.5 port 39674 connected with 10.8.0.4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.4 sec  21.0 MBytes  16.9 Mbits/sec

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
finger92commented, Nov 4, 2021

I tried the code on master and it was fine

0reactions
justheuristiccommented, Dec 21, 2021

Closing as inactive [feel free to open a new issue if the problem persists]

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resolve blocking problem caused by lock escalation - SQL ...
This article describes how to determine whether lock escalation is causing blocking and how to resolve the problem.
Read more >
The Difficulty with Deadlocks - Brent Ozar Unlimited
When a deadlock happens, SQL Server will kill off the cheapest transaction. The “cheapest” transaction is the transaction with the lowest cost.
Read more >
SQL Server deadlock definition and Overview - SQLShack
A deadlock occurs when 2 processes are competing for exclusive access to a resource but is unable to obtain exclusive access to it...
Read more >
Detect and Resolve SQL Deadlocks in SQL Server - SentryOne
Troubleshooting and resolving deadlocks in SQL Server to improve database performance is easy with SolarWinds SQL Sentry and Plan Explorer.
Read more >
SQL Server Deadlocks by Example - Redgate Software
The lock monitor takes no account of how long a transaction has been ... as it made no data changes and hence costs...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found