Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: No rendezvous handler for tcp://

See original GitHub issue

Hi @yumeng5 😊

Exciting work. 👍I’m trying to run the agnews.sh script you gave. And get the error below. I only have one GPU. Is this the error caused by this?

Environment info

transformers version: 3.4.0
Platform: windows10
Python version:3.6.9
PyTorch version (GPU?): 1.7
GPU: 1080ti * 1
Using distributed or parallel set-up in script?: yes

agnews.sh

“”" export CUDA_DEVICE_ORDER=PCI_BUS_ID export CUDA_VISIBLE_DEVICES=0

DATASET=agnews LABEL_NAME_FILE=label_names.txt TRAIN_CORPUS=train.txt TEST_CORPUS=test.txt TEST_LABEL=test_labels.txt MAX_LEN=200 TRAIN_BATCH=32 ACCUM_STEP=4 EVAL_BATCH=128 GPUS=1 MCP_EPOCH=3 SELF_TRAIN_EPOCH=1 … “”"

Problem

“”" Administrator@it-202007061711 MINGW64 /e/PycharmProjects/CCF/reference/LOTClass-master $ sh agnews.sh Namespace(accum_steps=4, category_vocab_size=100, dataset_dir=‘datasets/agnews/’, dist_port=12345, early_stop=False, eval_batch_size=128, final_model=‘final_model.pt’, gpus=1, label_names_file=‘label_names.txt’, match_threshold=20, max_l en=200, mcp_epochs=3, out_file=‘out.txt’, self_train_epochs=1.0, test_file=‘test.txt’, test_label_file=‘test_labels.txt’, top_pred_num=50, train_batch_size=32, train_file=‘train.txt’, update_interval=50) Effective training batch size: 128 Label names used for each class are: {0: [‘politics’], 1: [‘sports’], 2: [‘business’], 3: [‘technology’]} Some weights of the model checkpoint at bert-base-uncased were not used when initializing LOTClassModel: [‘bert.pooler.dense.weight’, ‘bert.pooler.dense.bias’, ‘cls.seq_relationship.weight’, ‘cls.seq_relationship.bias’]

This IS expected if you are initializing LOTClassModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing LOTClassModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of LOTClassModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: [‘cls.predictions.decoder.bias’, ‘dense.weight’, ‘dense.bias’, ‘classifier.weight’, ‘classifier.bias’] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Loading encoded texts from datasets/agnews/train.pt Loading texts with label names from datasets/agnews/label_name_data.pt Loading encoded texts from datasets/agnews/test.pt Contructing category vocabulary. Traceback (most recent call last): File “src/train.py”, line 69, in <module> main() File “src/train.py”, line 56, in main trainer.category_vocabulary(top_pred_num=args.top_pred_num, category_vocab_size=args.category_vocab_size) File “E:\PycharmProjects\CCF\reference\LOTClass-master\src\trainer.py”, line 296, in category_vocabulary mp.spawn(self.category_vocabulary_dist, nprocs=self.world_size, args=(top_pred_num, loader_name)) File “D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.py”, line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’) File “D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.py”, line 157, in start_processes while not context.join(): File “D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.py”, line 118, in join raise Exception(msg) Exception:

– Process 0 terminated with the following error: Traceback (most recent call last): File “D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.py”, line 19, in _wrap fn(i, *args) File “E:\PycharmProjects\CCF\reference\LOTClass-master\src\trainer.py”, line 260, in category_vocabulary_dist model = self.set_up_dist(rank) File “E:\PycharmProjects\CCF\reference\LOTClass-master\src\trainer.py”, line 67, in set_up_dist rank=rank File “D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\distributed\distributed_c10d.py”, line 421, in init_process_group init_method, rank, world_size, timeout=timeout File “D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\distributed\rendezvous.py”, line 82, in rendezvous raise RuntimeError(“No rendezvous handler for {}😕/”.format(result.scheme)) RuntimeError: No rendezvous handler for tcp:// “”" Looking forward to your reply, thankyou.

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

4reactions

yumeng5commented, Nov 30, 2020

Hi @JeremySun1224,

You can simply remove the .module part (i.e. torch.save(model.state_dict(), loader_file))–.module is only used when the model is a parallel-trained one.

Best, Yu

2reactions

yumeng5commented, Nov 28, 2020

Hi @JeremySun1224,

I forgot to mention that you need to change this part too–dist.all_gather is called whenever we want to gather tensors across multiple GPUs (e.g. for computing loss and test accuracy). If you are not using distributed training, then this part should be removed. Specifically, line 434-436 should be deleted because we don’t need to gather the training loss tensor. Similarly, other dist.all_gather function calls should be removed as well. Of course, you will want to take care of the device/names of the variables when removing the those lines (e.g. when you remove line 470-478, don’t forget to move input_ids, input_mask and preds to CPU by calling .cpu() and assign all_preds = preds).

Let me know if you still encounter other issues.

Thanks, Yu

Top Results From Across the Web

'RuntimeError: No rendezvous handler for env://' with multi-gpu

Bug I get an error 'RuntimeError: No rendezvous handler for env://' when I run my model with multiple GPU. Below the code and...

RuntimeError: No rendezvous handler for env:// on Windows

I am on windows 10 How can I solve the: RuntimeError: No rendezvous handler for env:// problem?

PyTorch's `dist_url` init method in Distributed Processing

On Windows, it gives the same error (RuntimeError: No rendezvous handler for tcp://) when init_process_group() is called. import torch.distributed as dist ...

Ddp on 2 GPUs: No rendezvous handler for env

I am testing a model with lightning, it has been working fine with 1 GPU. After added 2nd GPU today however, the following...

distributed/rendezvous.py · neilisaac/torch - Gemfury

Args: scheme (str): URL scheme to identify your rendezvous handler. handler ... not in _rendezvous_handlers: raise RuntimeError("No rendezvous handler for ...