question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Connection closed by peer error after ~4000 iterations

See original GitHub issue

Instructions To Reproduce the Issue:

  1. what changes you made (git diff) or what code you wrote

In train_net.py I use register_coco_instances() to register my train and validation datasets.

  1. what exact command you run: I run
tools/train_net.py --num-gpus 4 --config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml DATASETS.TRAIN ('mydataset_train',) DATASETS.TEST ('mydataset_val',) SOLVER.IMS_PER_BATCH 8 SOLVER.BASE_LR 0.01 OUTPUT_DIR /home/$USER/shared

within a Docker container.

  1. what you observed (including the full logs): After ~3750 iterations, I keep getting the following error. If I use less than ~3750, everything finishes successfully.
[02/05 23:05:59 d2.utils.events]: eta: 9:07:04  iter: 3659  total_loss: 1.117  loss_cls: 0.345  loss_box_reg: 0.285  loss_mask: 0.343  loss_rpn_cls: 0.077  loss_rpn_loc: 0.105  time: 0.3771  data_time: 0.0084  lr: 0.010000  max_mem: 3402M
[02/05 23:06:07 d2.utils.events]: eta: 9:07:43  iter: 3679  total_loss: 1.123  loss_cls: 0.355  loss_box_reg: 0.285  loss_mask: 0.353  loss_rpn_cls: 0.065  loss_rpn_loc: 0.092  time: 0.3772  data_time: 0.0082  lr: 0.010000  max_mem: 3402M
[02/05 23:06:14 d2.utils.events]: eta: 9:08:00  iter: 3699  total_loss: 1.137  loss_cls: 0.329  loss_box_reg: 0.300  loss_mask: 0.348  loss_rpn_cls: 0.068  loss_rpn_loc: 0.081  time: 0.3773  data_time: 0.0083  lr: 0.010000  max_mem: 3402M
[02/05 23:06:22 d2.utils.events]: eta: 9:07:50  iter: 3719  total_loss: 1.175  loss_cls: 0.351  loss_box_reg: 0.297  loss_mask: 0.360  loss_rpn_cls: 0.066  loss_rpn_loc: 0.096  time: 0.3773  data_time: 0.0089  lr: 0.010000  max_mem: 3402M
[02/05 23:06:30 d2.utils.events]: eta: 9:07:50  iter: 3739  total_loss: 1.186  loss_cls: 0.334  loss_box_reg: 0.296  loss_mask: 0.353  loss_rpn_cls: 0.079  loss_rpn_loc: 0.111  time: 0.3774  data_time: 0.0079  lr: 0.010000  max_mem: 3402M
[02/05 23:06:38 d2.utils.events]: eta: 9:08:46  iter: 3759  total_loss: 1.160  loss_cls: 0.359  loss_box_reg: 0.283  loss_mask: 0.355  loss_rpn_cls: 0.070  loss_rpn_loc: 0.081  time: 0.3774  data_time: 0.0084  lr: 0.010000  max_mem: 3402M
[02/05 23:06:45 d2.utils.events]: eta: 9:08:46  iter: 3779  total_loss: 1.285  loss_cls: 0.364  loss_box_reg: 0.303  loss_mask: 0.376  loss_rpn_cls: 0.070  loss_rpn_loc: 0.092  time: 0.3775  data_time: 0.0083  lr: 0.010000  max_mem: 3402M
[02/05 23:06:53 d2.utils.events]: eta: 9:08:20  iter: 3799  total_loss: 1.227  loss_cls: 0.340  loss_box_reg: 0.300  loss_mask: 0.347  loss_rpn_cls: 0.084  loss_rpn_loc: 0.115  time: 0.3775  data_time: 0.0085  lr: 0.010000  max_mem: 3402M
ERROR [02/05 23:07:15 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 220, in run_step
    self._write_metrics(metrics_dict)
  File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 255, in _write_metrics
    all_metrics_dict = comm.gather(metrics_dict)
  File "/podc/src/detectron2/detectron2/utils/comm.py", line 200, in gather
    size_list, tensor = _pad_to_largest_tensor(tensor, group)
  File "/podc/src/detectron2/detectron2/utils/comm.py", line 126, in _pad_to_largest_tensor
    dist.all_gather(size_list, local_size, group=group)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1158, in all_gather
    work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:572] Connection closed by peer [172.19.0.2]:13729
[02/05 23:07:15 d2.engine.hooks]: Overall training speed: 3803 iterations in 0:24:16 (0.3829 s / it)
[02/05 23:07:15 d2.engine.hooks]: Total training time: 0:24:19 (0:00:02 on hooks)

Expected behavior

Environment:

------------------------  ---------------------------------------------------------
sys.platform              linux
Python                    3.6.9 (default, Nov  7 2019, 10:44:02) [GCC 8.3.0]
numpy                     1.18.1
detectron2                0.1 @/podc/src/detectron2/detectron2
detectron2 compiler       GCC 7.4
detectron2 CUDA compiler  10.1
detectron2 arch flags     sm_60
DETECTRON2_ENV_MODULE     <not set>
PyTorch                   1.4.0 @/usr/local/lib/python3.6/dist-packages/torch
PyTorch debug build       False
CUDA available            True
GPU 0,1,2,3               Tesla P100-SXM2-16GB
CUDA_HOME                 /usr/local/cuda
NVCC                      Cuda compilation tools, release 10.1, V10.1.243
Pillow                    6.2.2
torchvision               0.5.0 @/usr/local/lib/python3.6/dist-packages/torchvision
torchvision arch flags    sm_35, sm_50, sm_60, sm_70, sm_75
------------------------  ---------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:10

github_iconTop GitHub Comments

2reactions
roger1993commented, Feb 9, 2020

@roger1993 have you met this kind of warnings? [02/09 08:25:25 d2.utils.events]: eta: 17 days, 5:56:44 iter: 99 total_loss: 0.334 loss_cls: 0.024 loss_box_reg: 0.030 loss_mask: 0.109 loss_mask_point: 0.152 loss_rpn_cls: 0.002 loss_rpn_loc: 0.018 time: 5.8734 data_time: 3.7765 lr: 0.001998 max_mem: 11965M /opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 44 bytes but only got 40. Skipping tag 37510 " Skipping tag %s" % (size, len(data), tag) /opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 8 bytes but only got 0. Skipping tag 41730 " Skipping tag %s" % (size, len(data), tag)

Because my dataset is from industry and it is very dirty, use read_image in detectron2 would cause training fails, so I change the read_image function from pillow to opencv at https://github.com/facebookresearch/detectron2/blob/master/detectron2/data/detection_utils.py#L36, u can roughly refers to this issue https://github.com/facebookresearch/detectron2/issues/788, I don’t have the same problem as urs

1reaction
CMobley7commented, Feb 10, 2020

So, running it with 1 GPU eventually failed because there were a few corrupt JPEGs in the dataset I was using. I re-downloaded the dataset and ran with both 1 GPU and 4 GPUs on 3 different boxes and didn’t receive the error again. Also, note, that I rebuilt my docker images before running this second test. When installing torchvision 0.5.0, Pillow 7.0.0 instead of 6.2.2 was installed. I think the ultimate issue was issues with certain images in my dataset. So, as @roger1993, I’d suggest either cleaning your data or making further updates to the read_image function to ensure all images that cause an error are skipped. However, it would be nice to see the default read_image function updated; so that, problematic images are skipped instead of giving an arbitrary Connection closed by peer error. Should I close this issue since my issue is solved or leave it open; so that, potentially work on the read_image function can be tied to an issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

HEC Error : Connection closed by peer - Splunk Community
It and of itself, may not indicate any issue. It may have just been that the remote client had finished sending data and...
Read more >
Socket Bind Error - java - Stack Overflow
When a socket is closed down, it is put in a special time-wait state for a certain amount of time. This is usually...
Read more >
Connection reset by peer. Remote backup service might be ...
Hello i am trying to back up nas to nas, in the same lan and i get an error Connection reset by peer....
Read more >
fsockopen - Manual - PHP
fsockopen — Open Internet or Unix domain socket connection ... The error will only become apparent when you read or write data to/from...
Read more >
Error Messages - Verastream Host Integrator - Micro Focus
This error occurs when a session fails to initialize in a timely manner. Detail the exact circumstances that generated this error and contact...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found