EOFerror at the end of multiprocessing
See original GitHub issueDescribe the bug Receiving EOFerror when multiprocessing, at the very end of training.
Minimal runnable code to reproduce the behavior Launching a fairseq training with a simple transformer model on multi-GPU. (I am aware this is not minimal at all, I hope this is enough for you to understand the issue)
Expected behavior Complete training without errors.
Environment
protobuf 3.12.2
torch 1.5.1
torchvision 0.6.0a0+35d732a
Python environment
conda create --name fairenv python=3.8
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
cd anaconda3/envs/fairenv/lib/python3.8/site-packages/
conda activate fairenv
git clone https://github.com/pytorch/fairseq.git
cd fairseq
pip install --editable .
conda install -c conda-forge tensorboardx
Log
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | distributed init (rank 0): tcp://localhost:18821
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | distributed init (rank 1): tcp://localhost:18821
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | initialized host decore1 as rank 1
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | initialized host decore1 as rank 0
2020-07-21 12:02:13 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.999)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer_test', attention_dropout=0.0, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='cross_entropy', cross_self_attention=False, curriculum=0, data='data/data-bin/dummy.tokenized', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=2, decoder_embed_dim=100, decoder_embed_path=None, decoder_ffn_embed_dim=100, decoder_input_dim=100, decoder_layerdrop=0, decoder_layers=2, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=100, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:18821', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=2, distributed_wrapper='DDP', dropout=0.1, empty_cache_freq=0, encoder_attention_heads=2, encoder_embed_dim=100, encoder_embed_path=None, encoder_ffn_embed_dim=100, encoder_layerdrop=0, encoder_layers=2, encoder_layers_to_keep=None, encoder_learned_pos=False, encoder_normalize_before=False, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, left_pad_source='True', left_pad_target='False', load_alignments=False, localsgd_frequency=3, log_format='json', log_interval=100, lr=[0.0001], lr_scheduler='inverse_sqrt', max_epoch=2, max_sentences=None, max_sentences_valid=None, max_source_positions=1000, max_target_positions=1000, max_tokens=1000, max_tokens_valid=1000, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, model_parallel_size=1, no_cross_attention=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=2, num_batch_buckets=0, num_workers=1, optimizer='adam', optimizer_overrides='{}', patience=-1, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/dummy', save_interval=1, save_interval_updates=0, seed=0, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, source_lang=None, stop_time_hours=0, target_lang=None, task='translation', tensorboard_logdir='checkpoints/dummy/logs', threshold_loss_scale=None, tie_adaptive_weights=False, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, update_freq=[1], upsample_primary=1, use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
2020-07-21 12:02:13 | INFO | fairseq.tasks.translation | [fr] dictionary: 4632 types
2020-07-21 12:02:13 | INFO | fairseq.tasks.translation | [en] dictionary: 4632 types
2020-07-21 12:02:13 | INFO | fairseq.data.data_utils | loaded 887 examples from: data/data-bin/dummy.tokenized/valid.fr-en.fr
2020-07-21 12:02:13 | INFO | fairseq.data.data_utils | loaded 887 examples from: data/data-bin/dummy.tokenized/valid.fr-en.en
2020-07-21 12:02:13 | INFO | fairseq.tasks.translation | data/data-bin/dummy.tokenized valid fr-en 887 examples
2020-07-21 12:02:14 | INFO | fairseq_cli.train | TransformerModel(
(encoder): TransformerEncoder(
(dropout_module): FairseqDropout()
(embed_tokens): Embedding(4632, 100, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=100, out_features=100, bias=True)
(v_proj): Linear(in_features=100, out_features=100, bias=True)
(q_proj): Linear(in_features=100, out_features=100, bias=True)
(out_proj): Linear(in_features=100, out_features=100, bias=True)
)
(self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=100, out_features=100, bias=True)
(fc2): Linear(in_features=100, out_features=100, bias=True)
(final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=100, out_features=100, bias=True)
(v_proj): Linear(in_features=100, out_features=100, bias=True)
(q_proj): Linear(in_features=100, out_features=100, bias=True)
(out_proj): Linear(in_features=100, out_features=100, bias=True)
)
(self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=100, out_features=100, bias=True)
(fc2): Linear(in_features=100, out_features=100, bias=True)
(final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): TransformerDecoder(
(dropout_module): FairseqDropout()
(embed_tokens): Embedding(4632, 100, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=100, out_features=100, bias=True)
(v_proj): Linear(in_features=100, out_features=100, bias=True)
(q_proj): Linear(in_features=100, out_features=100, bias=True)
(out_proj): Linear(in_features=100, out_features=100, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=100, out_features=100, bias=True)
(v_proj): Linear(in_features=100, out_features=100, bias=True)
(q_proj): Linear(in_features=100, out_features=100, bias=True)
(out_proj): Linear(in_features=100, out_features=100, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=100, out_features=100, bias=True)
(fc2): Linear(in_features=100, out_features=100, bias=True)
(final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=100, out_features=100, bias=True)
(v_proj): Linear(in_features=100, out_features=100, bias=True)
(q_proj): Linear(in_features=100, out_features=100, bias=True)
(out_proj): Linear(in_features=100, out_features=100, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=100, out_features=100, bias=True)
(v_proj): Linear(in_features=100, out_features=100, bias=True)
(q_proj): Linear(in_features=100, out_features=100, bias=True)
(out_proj): Linear(in_features=100, out_features=100, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=100, out_features=100, bias=True)
(fc2): Linear(in_features=100, out_features=100, bias=True)
(final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
)
)
(output_projection): Linear(in_features=100, out_features=4632, bias=False)
)
)
2020-07-21 12:02:14 | INFO | fairseq_cli.train | model transformer_test, criterion CrossEntropyCriterion
2020-07-21 12:02:14 | INFO | fairseq_cli.train | num. model params: 788400 (num. trained: 788400)
2020-07-21 12:02:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2020-07-21 12:02:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2020-07-21 12:02:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2020-07-21 12:02:14 | INFO | fairseq.utils | rank 0: capabilities = 6.1 ; total memory = 11.910 GB ; name = TITAN Xp
2020-07-21 12:02:14 | INFO | fairseq.utils | rank 1: capabilities = 6.1 ; total memory = 11.910 GB ; name = TITAN Xp
2020-07-21 12:02:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2020-07-21 12:02:14 | INFO | fairseq_cli.train | training on 2 devices (GPUs/TPUs)
2020-07-21 12:02:14 | INFO | fairseq_cli.train | max tokens per GPU = 1000 and max sentences per GPU = None
2020-07-21 12:02:14 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/dummy/checkpoint_last.pt
2020-07-21 12:02:14 | INFO | fairseq.trainer | loading train data for epoch 1
2020-07-21 12:02:14 | INFO | fairseq.data.data_utils | loaded 954 examples from: data/data-bin/dummy.tokenized/train.fr-en.fr
2020-07-21 12:02:14 | INFO | fairseq.data.data_utils | loaded 954 examples from: data/data-bin/dummy.tokenized/train.fr-en.en
2020-07-21 12:02:14 | INFO | fairseq.tasks.translation | data/data-bin/dummy.tokenized train fr-en 954 examples
2020-07-21 12:02:14 | INFO | fairseq_cli.train | begin training epoch 1
/data1/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/fairseq/fairseq/utils.py:303: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
warnings.warn(
/data1/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/fairseq/fairseq/utils.py:303: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
warnings.warn(
2020-07-21 12:02:16 | INFO | fairseq_cli.train | begin validation on "valid" subset
2020-07-21 12:02:18 | INFO | valid | {"epoch": 1, "valid_loss": "12.856", "valid_ppl": "7414.09", "valid_wps": "75523.5", "valid_wpb": "1120.8", "valid_bsz": "35.5", "valid_num_updates": "16"}
2020-07-21 12:02:18 | INFO | fairseq_cli.train | begin save checkpoint
2020-07-21 12:02:18 | INFO | fairseq.checkpoint_utils | saved checkpoint checkpoints/dummy/checkpoint1.pt (epoch 1 @ 16 updates, score 12.856) (writing took 0.673235297203064 seconds)
2020-07-21 12:02:18 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below)
2020-07-21 12:02:18 | INFO | train | {"epoch": 1, "train_loss": "12.934", "train_ppl": "7823.58", "train_wps": "5812.3", "train_ups": "4.72", "train_wpb": "1246.1", "train_bsz": "59.6", "train_num_updates": "16", "train_lr": "4.996e-07", "train_gnorm": "1.985", "train_train_wall": "1", "train_wall": "5"}
2020-07-21 12:02:18 | INFO | fairseq_cli.train | begin training epoch 1
2020-07-21 12:02:20 | INFO | fairseq_cli.train | begin validation on "valid" subset
2020-07-21 12:02:22 | INFO | valid | {"epoch": 2, "valid_loss": "12.85", "valid_ppl": "7381.96", "valid_wps": "59457.9", "valid_wpb": "1120.8", "valid_bsz": "35.5", "valid_num_updates": "32", "valid_best_loss": "12.85"}
2020-07-21 12:02:22 | INFO | fairseq_cli.train | begin save checkpoint
2020-07-21 12:02:23 | INFO | fairseq.checkpoint_utils | saved checkpoint checkpoints/dummy/checkpoint2.pt (epoch 2 @ 32 updates, score 12.85) (writing took 0.6933992877602577 seconds)
2020-07-21 12:02:23 | INFO | fairseq_cli.train | end of epoch 2 (average epoch stats below)
2020-07-21 12:02:23 | INFO | train | {"epoch": 2, "train_loss": "12.931", "train_ppl": "7807.27", "train_wps": "4502", "train_ups": "3.61", "train_wpb": "1246.1", "train_bsz": "59.6", "train_num_updates": "32", "train_lr": "8.992e-07", "train_gnorm": "1.992", "train_train_wall": "1", "train_wall": "9"}
2020-07-21 12:02:23 | INFO | fairseq_cli.train | done training in 9.2 seconds
Exception in thread Thread-3:
Exception in thread Thread-4:
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/threading.py", line 932, in _bootstrap_inner
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 202, in run
self.run()
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 202, in run
data = self._queue.get(True, queue_wait_duration)
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/queues.py", line 111, in get
data = self._queue.get(True, queue_wait_duration)
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/queues.py", line 111, in get
res = self._recv_bytes()
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
res = self._recv_bytes()
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
buf = self._recv(4)
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
raise EOFError
EOFError
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:8 (1 by maintainers)
Top Results From Across the Web
python multiprocessing queue error - Stack Overflow
The results of the processing are put on a queue, which is watched by another process. The code runs, but after completion I...
Read more >Multiprocesing socket - Python Forum
Raises EOFError if there is nothing left to receive and the other end was closed. So the EOFError is expected when the data...
Read more >17.2. multiprocessing — Process-based parallelism
Raises EOFError if there is nothing left to receive and the other end has closed. If maxlength is specified and the message is...
Read more >Steps to Avoid EOFError in Python with Examples - eduCBA
Explanation: In the above program, try and except blocks are used to avoid the EOFError exception by using an empty string that will...
Read more >multiprocessing.Connection does not communicate pipe ...
My expectation was that when calling recv() on the remote end, it should raise EOFError if the pipe has been closed.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Didn’t you forget to close a writer as suggested here ? If not, maybe you can try adding a suffix specific to each writer when creating the writer ? Something like:
writer = SummaryWriter(log_dir, filename_suffix=f'_{run_id}')
I have downgraded the tensorboardx package to 2.1 and it worked at torch 1.7.1 and cuda 11 environment.