RuntimeError: No rendezvous handler for tcp://
See original GitHub issueHi @yumeng5 đ
Exciting work. đIâm trying to run the agnews.sh script you gave. And get the error below. I only have one GPU. Is this the error caused by this?
Environment info
- transformers version: 3.4.0
- Platform: windows10
- Python version:3.6.9
- PyTorch version (GPU?): 1.7
- GPU: 1080ti * 1
- Using distributed or parallel set-up in script?: yes
agnews.sh
ââ" export CUDA_DEVICE_ORDER=PCI_BUS_ID export CUDA_VISIBLE_DEVICES=0
DATASET=agnews LABEL_NAME_FILE=label_names.txt TRAIN_CORPUS=train.txt TEST_CORPUS=test.txt TEST_LABEL=test_labels.txt MAX_LEN=200 TRAIN_BATCH=32 ACCUM_STEP=4 EVAL_BATCH=128 GPUS=1 MCP_EPOCH=3 SELF_TRAIN_EPOCH=1 ⌠ââ"
Problem
ââ" Administrator@it-202007061711 MINGW64 /e/PycharmProjects/CCF/reference/LOTClass-master $ sh agnews.sh Namespace(accum_steps=4, category_vocab_size=100, dataset_dir=âdatasets/agnews/â, dist_port=12345, early_stop=False, eval_batch_size=128, final_model=âfinal_model.ptâ, gpus=1, label_names_file=âlabel_names.txtâ, match_threshold=20, max_l en=200, mcp_epochs=3, out_file=âout.txtâ, self_train_epochs=1.0, test_file=âtest.txtâ, test_label_file=âtest_labels.txtâ, top_pred_num=50, train_batch_size=32, train_file=âtrain.txtâ, update_interval=50) Effective training batch size: 128 Label names used for each class are: {0: [âpoliticsâ], 1: [âsportsâ], 2: [âbusinessâ], 3: [âtechnologyâ]} Some weights of the model checkpoint at bert-base-uncased were not used when initializing LOTClassModel: [âbert.pooler.dense.weightâ, âbert.pooler.dense.biasâ, âcls.seq_relationship.weightâ, âcls.seq_relationship.biasâ]
- This IS expected if you are initializing LOTClassModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LOTClassModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of LOTClassModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: [âcls.predictions.decoder.biasâ, âdense.weightâ, âdense.biasâ, âclassifier.weightâ, âclassifier.biasâ] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Loading encoded texts from datasets/agnews/train.pt Loading texts with label names from datasets/agnews/label_name_data.pt Loading encoded texts from datasets/agnews/test.pt Contructing category vocabulary. Traceback (most recent call last): File âsrc/train.pyâ, line 69, in <module> main() File âsrc/train.pyâ, line 56, in main trainer.category_vocabulary(top_pred_num=args.top_pred_num, category_vocab_size=args.category_vocab_size) File âE:\PycharmProjects\CCF\reference\LOTClass-master\src\trainer.pyâ, line 296, in category_vocabulary mp.spawn(self.category_vocabulary_dist, nprocs=self.world_size, args=(top_pred_num, loader_name)) File âD:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.pyâ, line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method=âspawnâ) File âD:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.pyâ, line 157, in start_processes while not context.join(): File âD:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.pyâ, line 118, in join raise Exception(msg) Exception:
â Process 0 terminated with the following error: Traceback (most recent call last): File âD:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.pyâ, line 19, in _wrap fn(i, *args) File âE:\PycharmProjects\CCF\reference\LOTClass-master\src\trainer.pyâ, line 260, in category_vocabulary_dist model = self.set_up_dist(rank) File âE:\PycharmProjects\CCF\reference\LOTClass-master\src\trainer.pyâ, line 67, in set_up_dist rank=rank File âD:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\distributed\distributed_c10d.pyâ, line 421, in init_process_group init_method, rank, world_size, timeout=timeout File âD:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\distributed\rendezvous.pyâ, line 82, in rendezvous raise RuntimeError(âNo rendezvous handler for {}đ/â.format(result.scheme)) RuntimeError: No rendezvous handler for tcp:// ââ" Looking forward to your reply, thankyou.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
Hi @JeremySun1224,
You can simply remove the
.module
part (i.e.torch.save(model.state_dict(), loader_file)
)â.module
is only used when the model is a parallel-trained one.Best, Yu
Hi @JeremySun1224,
I forgot to mention that you need to change this part tooâ
dist.all_gather
is called whenever we want to gather tensors across multiple GPUs (e.g. for computing loss and test accuracy). If you are not using distributed training, then this part should be removed. Specifically, line 434-436 should be deleted because we donât need to gather the training loss tensor. Similarly, otherdist.all_gather
function calls should be removed as well. Of course, you will want to take care of the device/names of the variables when removing the those lines (e.g. when you remove line 470-478, donât forget to moveinput_ids
,input_mask
andpreds
to CPU by calling.cpu()
and assignall_preds = preds
).Let me know if you still encounter other issues.
Thanks, Yu