question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: No rendezvous handler for tcp://

See original GitHub issue

Hi @yumeng5 😊

Exciting work. 👍I’m trying to run the agnews.sh script you gave. And get the error below. I only have one GPU. Is this the error caused by this?

Environment info

  • transformers version: 3.4.0
  • Platform: windows10
  • Python version:3.6.9
  • PyTorch version (GPU?): 1.7
  • GPU: 1080ti * 1
  • Using distributed or parallel set-up in script?: yes

agnews.sh

“”" export CUDA_DEVICE_ORDER=PCI_BUS_ID export CUDA_VISIBLE_DEVICES=0

DATASET=agnews LABEL_NAME_FILE=label_names.txt TRAIN_CORPUS=train.txt TEST_CORPUS=test.txt TEST_LABEL=test_labels.txt MAX_LEN=200 TRAIN_BATCH=32 ACCUM_STEP=4 EVAL_BATCH=128 GPUS=1 MCP_EPOCH=3 SELF_TRAIN_EPOCH=1 … “”"

Problem

“”" Administrator@it-202007061711 MINGW64 /e/PycharmProjects/CCF/reference/LOTClass-master $ sh agnews.sh Namespace(accum_steps=4, category_vocab_size=100, dataset_dir=‘datasets/agnews/’, dist_port=12345, early_stop=False, eval_batch_size=128, final_model=‘final_model.pt’, gpus=1, label_names_file=‘label_names.txt’, match_threshold=20, max_l en=200, mcp_epochs=3, out_file=‘out.txt’, self_train_epochs=1.0, test_file=‘test.txt’, test_label_file=‘test_labels.txt’, top_pred_num=50, train_batch_size=32, train_file=‘train.txt’, update_interval=50) Effective training batch size: 128 Label names used for each class are: {0: [‘politics’], 1: [‘sports’], 2: [‘business’], 3: [‘technology’]} Some weights of the model checkpoint at bert-base-uncased were not used when initializing LOTClassModel: [‘bert.pooler.dense.weight’, ‘bert.pooler.dense.bias’, ‘cls.seq_relationship.weight’, ‘cls.seq_relationship.bias’]

  • This IS expected if you are initializing LOTClassModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing LOTClassModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of LOTClassModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: [‘cls.predictions.decoder.bias’, ‘dense.weight’, ‘dense.bias’, ‘classifier.weight’, ‘classifier.bias’] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Loading encoded texts from datasets/agnews/train.pt Loading texts with label names from datasets/agnews/label_name_data.pt Loading encoded texts from datasets/agnews/test.pt Contructing category vocabulary. Traceback (most recent call last): File “src/train.py”, line 69, in <module> main() File “src/train.py”, line 56, in main trainer.category_vocabulary(top_pred_num=args.top_pred_num, category_vocab_size=args.category_vocab_size) File “E:\PycharmProjects\CCF\reference\LOTClass-master\src\trainer.py”, line 296, in category_vocabulary mp.spawn(self.category_vocabulary_dist, nprocs=self.world_size, args=(top_pred_num, loader_name)) File “D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.py”, line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’) File “D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.py”, line 157, in start_processes while not context.join(): File “D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.py”, line 118, in join raise Exception(msg) Exception:

– Process 0 terminated with the following error: Traceback (most recent call last): File “D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.py”, line 19, in _wrap fn(i, *args) File “E:\PycharmProjects\CCF\reference\LOTClass-master\src\trainer.py”, line 260, in category_vocabulary_dist model = self.set_up_dist(rank) File “E:\PycharmProjects\CCF\reference\LOTClass-master\src\trainer.py”, line 67, in set_up_dist rank=rank File “D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\distributed\distributed_c10d.py”, line 421, in init_process_group init_method, rank, world_size, timeout=timeout File “D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\distributed\rendezvous.py”, line 82, in rendezvous raise RuntimeError(“No rendezvous handler for {}😕/”.format(result.scheme)) RuntimeError: No rendezvous handler for tcp:// “”" Looking forward to your reply, thankyou.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

4reactions
yumeng5commented, Nov 30, 2020

Hi @JeremySun1224,

You can simply remove the .module part (i.e. torch.save(model.state_dict(), loader_file))–.module is only used when the model is a parallel-trained one.

Best, Yu

2reactions
yumeng5commented, Nov 28, 2020

Hi @JeremySun1224,

I forgot to mention that you need to change this part too–dist.all_gather is called whenever we want to gather tensors across multiple GPUs (e.g. for computing loss and test accuracy). If you are not using distributed training, then this part should be removed. Specifically, line 434-436 should be deleted because we don’t need to gather the training loss tensor. Similarly, other dist.all_gather function calls should be removed as well. Of course, you will want to take care of the device/names of the variables when removing the those lines (e.g. when you remove line 470-478, don’t forget to move input_ids, input_mask and preds to CPU by calling .cpu() and assign all_preds = preds).

Let me know if you still encounter other issues.

Thanks, Yu

Read more comments on GitHub >

github_iconTop Results From Across the Web

'RuntimeError: No rendezvous handler for env://' with multi-gpu
Bug I get an error 'RuntimeError: No rendezvous handler for env://' when I run my model with multiple GPU. Below the code and...
Read more >
RuntimeError: No rendezvous handler for env:// on Windows
I am on windows 10 How can I solve the: RuntimeError: No rendezvous handler for env:// problem?
Read more >
PyTorch's `dist_url` init method in Distributed Processing
On Windows, it gives the same error (RuntimeError: No rendezvous handler for tcp://) when init_process_group() is called. import torch.distributed as dist ...
Read more >
Ddp on 2 GPUs: No rendezvous handler for env
I am testing a model with lightning, it has been working fine with 1 GPU. After added 2nd GPU today however, the following...
Read more >
distributed/rendezvous.py ¡ neilisaac/torch - Gemfury
Args: scheme (str): URL scheme to identify your rendezvous handler. handler ... not in _rendezvous_handlers: raise RuntimeError("No rendezvous handler for ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found