question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

EOFerror at the end of multiprocessing

See original GitHub issue

Describe the bug Receiving EOFerror when multiprocessing, at the very end of training.

Minimal runnable code to reproduce the behavior Launching a fairseq training with a simple transformer model on multi-GPU. (I am aware this is not minimal at all, I hope this is enough for you to understand the issue)

Expected behavior Complete training without errors.

Environment

protobuf      3.12.2
torch         1.5.1
torchvision   0.6.0a0+35d732a

Python environment

conda create --name fairenv python=3.8
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
cd anaconda3/envs/fairenv/lib/python3.8/site-packages/
conda activate fairenv
git clone https://github.com/pytorch/fairseq.git
cd fairseq
pip install --editable .
conda install -c conda-forge tensorboardx 

Log

2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | distributed init (rank 0): tcp://localhost:18821
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | distributed init (rank 1): tcp://localhost:18821
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | initialized host decore1 as rank 1
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | initialized host decore1 as rank 0
2020-07-21 12:02:13 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.999)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer_test', attention_dropout=0.0, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='cross_entropy', cross_self_attention=False, curriculum=0, data='data/data-bin/dummy.tokenized', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=2, decoder_embed_dim=100, decoder_embed_path=None, decoder_ffn_embed_dim=100, decoder_input_dim=100, decoder_layerdrop=0, decoder_layers=2, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=100, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:18821', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=2, distributed_wrapper='DDP', dropout=0.1, empty_cache_freq=0, encoder_attention_heads=2, encoder_embed_dim=100, encoder_embed_path=None, encoder_ffn_embed_dim=100, encoder_layerdrop=0, encoder_layers=2, encoder_layers_to_keep=None, encoder_learned_pos=False, encoder_normalize_before=False, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, left_pad_source='True', left_pad_target='False', load_alignments=False, localsgd_frequency=3, log_format='json', log_interval=100, lr=[0.0001], lr_scheduler='inverse_sqrt', max_epoch=2, max_sentences=None, max_sentences_valid=None, max_source_positions=1000, max_target_positions=1000, max_tokens=1000, max_tokens_valid=1000, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, model_parallel_size=1, no_cross_attention=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=2, num_batch_buckets=0, num_workers=1, optimizer='adam', optimizer_overrides='{}', patience=-1, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/dummy', save_interval=1, save_interval_updates=0, seed=0, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, source_lang=None, stop_time_hours=0, target_lang=None, task='translation', tensorboard_logdir='checkpoints/dummy/logs', threshold_loss_scale=None, tie_adaptive_weights=False, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, update_freq=[1], upsample_primary=1, use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
2020-07-21 12:02:13 | INFO | fairseq.tasks.translation | [fr] dictionary: 4632 types
2020-07-21 12:02:13 | INFO | fairseq.tasks.translation | [en] dictionary: 4632 types
2020-07-21 12:02:13 | INFO | fairseq.data.data_utils | loaded 887 examples from: data/data-bin/dummy.tokenized/valid.fr-en.fr
2020-07-21 12:02:13 | INFO | fairseq.data.data_utils | loaded 887 examples from: data/data-bin/dummy.tokenized/valid.fr-en.en
2020-07-21 12:02:13 | INFO | fairseq.tasks.translation | data/data-bin/dummy.tokenized valid fr-en 887 examples
2020-07-21 12:02:14 | INFO | fairseq_cli.train | TransformerModel(
  (encoder): TransformerEncoder(
    (dropout_module): FairseqDropout()
    (embed_tokens): Embedding(4632, 100, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=100, out_features=100, bias=True)
          (v_proj): Linear(in_features=100, out_features=100, bias=True)
          (q_proj): Linear(in_features=100, out_features=100, bias=True)
          (out_proj): Linear(in_features=100, out_features=100, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=100, out_features=100, bias=True)
        (fc2): Linear(in_features=100, out_features=100, bias=True)
        (final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=100, out_features=100, bias=True)
          (v_proj): Linear(in_features=100, out_features=100, bias=True)
          (q_proj): Linear(in_features=100, out_features=100, bias=True)
          (out_proj): Linear(in_features=100, out_features=100, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=100, out_features=100, bias=True)
        (fc2): Linear(in_features=100, out_features=100, bias=True)
        (final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (decoder): TransformerDecoder(
    (dropout_module): FairseqDropout()
    (embed_tokens): Embedding(4632, 100, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=100, out_features=100, bias=True)
          (v_proj): Linear(in_features=100, out_features=100, bias=True)
          (q_proj): Linear(in_features=100, out_features=100, bias=True)
          (out_proj): Linear(in_features=100, out_features=100, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=100, out_features=100, bias=True)
          (v_proj): Linear(in_features=100, out_features=100, bias=True)
          (q_proj): Linear(in_features=100, out_features=100, bias=True)
          (out_proj): Linear(in_features=100, out_features=100, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=100, out_features=100, bias=True)
        (fc2): Linear(in_features=100, out_features=100, bias=True)
        (final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=100, out_features=100, bias=True)
          (v_proj): Linear(in_features=100, out_features=100, bias=True)
          (q_proj): Linear(in_features=100, out_features=100, bias=True)
          (out_proj): Linear(in_features=100, out_features=100, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=100, out_features=100, bias=True)
          (v_proj): Linear(in_features=100, out_features=100, bias=True)
          (q_proj): Linear(in_features=100, out_features=100, bias=True)
          (out_proj): Linear(in_features=100, out_features=100, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=100, out_features=100, bias=True)
        (fc2): Linear(in_features=100, out_features=100, bias=True)
        (final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
      )
    )
    (output_projection): Linear(in_features=100, out_features=4632, bias=False)
  )
)
2020-07-21 12:02:14 | INFO | fairseq_cli.train | model transformer_test, criterion CrossEntropyCriterion
2020-07-21 12:02:14 | INFO | fairseq_cli.train | num. model params: 788400 (num. trained: 788400)
2020-07-21 12:02:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2020-07-21 12:02:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2020-07-21 12:02:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2020-07-21 12:02:14 | INFO | fairseq.utils | rank   0: capabilities =  6.1  ; total memory = 11.910 GB ; name = TITAN Xp                                
2020-07-21 12:02:14 | INFO | fairseq.utils | rank   1: capabilities =  6.1  ; total memory = 11.910 GB ; name = TITAN Xp                                
2020-07-21 12:02:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2020-07-21 12:02:14 | INFO | fairseq_cli.train | training on 2 devices (GPUs/TPUs)
2020-07-21 12:02:14 | INFO | fairseq_cli.train | max tokens per GPU = 1000 and max sentences per GPU = None
2020-07-21 12:02:14 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/dummy/checkpoint_last.pt
2020-07-21 12:02:14 | INFO | fairseq.trainer | loading train data for epoch 1
2020-07-21 12:02:14 | INFO | fairseq.data.data_utils | loaded 954 examples from: data/data-bin/dummy.tokenized/train.fr-en.fr
2020-07-21 12:02:14 | INFO | fairseq.data.data_utils | loaded 954 examples from: data/data-bin/dummy.tokenized/train.fr-en.en
2020-07-21 12:02:14 | INFO | fairseq.tasks.translation | data/data-bin/dummy.tokenized train fr-en 954 examples
2020-07-21 12:02:14 | INFO | fairseq_cli.train | begin training epoch 1
/data1/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/fairseq/fairseq/utils.py:303: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  warnings.warn(
/data1/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/fairseq/fairseq/utils.py:303: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  warnings.warn(
2020-07-21 12:02:16 | INFO | fairseq_cli.train | begin validation on "valid" subset
2020-07-21 12:02:18 | INFO | valid | {"epoch": 1, "valid_loss": "12.856", "valid_ppl": "7414.09", "valid_wps": "75523.5", "valid_wpb": "1120.8", "valid_bsz": "35.5", "valid_num_updates": "16"}
2020-07-21 12:02:18 | INFO | fairseq_cli.train | begin save checkpoint
2020-07-21 12:02:18 | INFO | fairseq.checkpoint_utils | saved checkpoint checkpoints/dummy/checkpoint1.pt (epoch 1 @ 16 updates, score 12.856) (writing took 0.673235297203064 seconds)
2020-07-21 12:02:18 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below)
2020-07-21 12:02:18 | INFO | train | {"epoch": 1, "train_loss": "12.934", "train_ppl": "7823.58", "train_wps": "5812.3", "train_ups": "4.72", "train_wpb": "1246.1", "train_bsz": "59.6", "train_num_updates": "16", "train_lr": "4.996e-07", "train_gnorm": "1.985", "train_train_wall": "1", "train_wall": "5"}
2020-07-21 12:02:18 | INFO | fairseq_cli.train | begin training epoch 1
2020-07-21 12:02:20 | INFO | fairseq_cli.train | begin validation on "valid" subset
2020-07-21 12:02:22 | INFO | valid | {"epoch": 2, "valid_loss": "12.85", "valid_ppl": "7381.96", "valid_wps": "59457.9", "valid_wpb": "1120.8", "valid_bsz": "35.5", "valid_num_updates": "32", "valid_best_loss": "12.85"}
2020-07-21 12:02:22 | INFO | fairseq_cli.train | begin save checkpoint
2020-07-21 12:02:23 | INFO | fairseq.checkpoint_utils | saved checkpoint checkpoints/dummy/checkpoint2.pt (epoch 2 @ 32 updates, score 12.85) (writing took 0.6933992877602577 seconds)
2020-07-21 12:02:23 | INFO | fairseq_cli.train | end of epoch 2 (average epoch stats below)
2020-07-21 12:02:23 | INFO | train | {"epoch": 2, "train_loss": "12.931", "train_ppl": "7807.27", "train_wps": "4502", "train_ups": "3.61", "train_wpb": "1246.1", "train_bsz": "59.6", "train_num_updates": "32", "train_lr": "8.992e-07", "train_gnorm": "1.992", "train_train_wall": "1", "train_wall": "9"}
2020-07-21 12:02:23 | INFO | fairseq_cli.train | done training in 9.2 seconds
Exception in thread Thread-3:
Exception in thread Thread-4:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 202, in run
    self.run()
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 202, in run
    data = self._queue.get(True, queue_wait_duration)
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/queues.py", line 111, in get
    data = self._queue.get(True, queue_wait_duration)
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/queues.py", line 111, in get
    res = self._recv_bytes()
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
    res = self._recv_bytes()
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    buf = self._recv(4)
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
    raise EOFError
EOFError

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:3
  • Comments:8 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
LudoHackathoncommented, Jan 11, 2021

Didn’t you forget to close a writer as suggested here ? If not, maybe you can try adding a suffix specific to each writer when creating the writer ? Something like: writer = SummaryWriter(log_dir, filename_suffix=f'_{run_id}')

1reaction
jiminsuncommented, May 13, 2021

I have downgraded the tensorboardx package to 2.1 and it worked at torch 1.7.1 and cuda 11 environment.

Read more comments on GitHub >

github_iconTop Results From Across the Web

python multiprocessing queue error - Stack Overflow
The results of the processing are put on a queue, which is watched by another process. The code runs, but after completion I...
Read more >
Multiprocesing socket - Python Forum
Raises EOFError if there is nothing left to receive and the other end was closed. So the EOFError is expected when the data...
Read more >
17.2. multiprocessing — Process-based parallelism
Raises EOFError if there is nothing left to receive and the other end has closed. If maxlength is specified and the message is...
Read more >
Steps to Avoid EOFError in Python with Examples - eduCBA
Explanation: In the above program, try and except blocks are used to avoid the EOFError exception by using an empty string that will...
Read more >
multiprocessing.Connection does not communicate pipe ...
My expectation was that when calling recv() on the remote end, it should raise EOFError if the pipe has been closed.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found