Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Connection closed by peer error after ~4000 iterations

See original GitHub issue

Instructions To Reproduce the Issue:

what changes you made (git diff) or what code you wrote

In train_net.py I use register_coco_instances() to register my train and validation datasets.

what exact command you run: I run

tools/train_net.py --num-gpus 4 --config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml DATASETS.TRAIN ('mydataset_train',) DATASETS.TEST ('mydataset_val',) SOLVER.IMS_PER_BATCH 8 SOLVER.BASE_LR 0.01 OUTPUT_DIR /home/$USER/shared

within a Docker container.

what you observed (including the full logs): After ~3750 iterations, I keep getting the following error. If I use less than ~3750, everything finishes successfully.

[02/05 23:05:59 d2.utils.events]: eta: 9:07:04  iter: 3659  total_loss: 1.117  loss_cls: 0.345  loss_box_reg: 0.285  loss_mask: 0.343  loss_rpn_cls: 0.077  loss_rpn_loc: 0.105  time: 0.3771  data_time: 0.0084  lr: 0.010000  max_mem: 3402M
[02/05 23:06:07 d2.utils.events]: eta: 9:07:43  iter: 3679  total_loss: 1.123  loss_cls: 0.355  loss_box_reg: 0.285  loss_mask: 0.353  loss_rpn_cls: 0.065  loss_rpn_loc: 0.092  time: 0.3772  data_time: 0.0082  lr: 0.010000  max_mem: 3402M
[02/05 23:06:14 d2.utils.events]: eta: 9:08:00  iter: 3699  total_loss: 1.137  loss_cls: 0.329  loss_box_reg: 0.300  loss_mask: 0.348  loss_rpn_cls: 0.068  loss_rpn_loc: 0.081  time: 0.3773  data_time: 0.0083  lr: 0.010000  max_mem: 3402M
[02/05 23:06:22 d2.utils.events]: eta: 9:07:50  iter: 3719  total_loss: 1.175  loss_cls: 0.351  loss_box_reg: 0.297  loss_mask: 0.360  loss_rpn_cls: 0.066  loss_rpn_loc: 0.096  time: 0.3773  data_time: 0.0089  lr: 0.010000  max_mem: 3402M
[02/05 23:06:30 d2.utils.events]: eta: 9:07:50  iter: 3739  total_loss: 1.186  loss_cls: 0.334  loss_box_reg: 0.296  loss_mask: 0.353  loss_rpn_cls: 0.079  loss_rpn_loc: 0.111  time: 0.3774  data_time: 0.0079  lr: 0.010000  max_mem: 3402M
[02/05 23:06:38 d2.utils.events]: eta: 9:08:46  iter: 3759  total_loss: 1.160  loss_cls: 0.359  loss_box_reg: 0.283  loss_mask: 0.355  loss_rpn_cls: 0.070  loss_rpn_loc: 0.081  time: 0.3774  data_time: 0.0084  lr: 0.010000  max_mem: 3402M
[02/05 23:06:45 d2.utils.events]: eta: 9:08:46  iter: 3779  total_loss: 1.285  loss_cls: 0.364  loss_box_reg: 0.303  loss_mask: 0.376  loss_rpn_cls: 0.070  loss_rpn_loc: 0.092  time: 0.3775  data_time: 0.0083  lr: 0.010000  max_mem: 3402M
[02/05 23:06:53 d2.utils.events]: eta: 9:08:20  iter: 3799  total_loss: 1.227  loss_cls: 0.340  loss_box_reg: 0.300  loss_mask: 0.347  loss_rpn_cls: 0.084  loss_rpn_loc: 0.115  time: 0.3775  data_time: 0.0085  lr: 0.010000  max_mem: 3402M
ERROR [02/05 23:07:15 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 220, in run_step
    self._write_metrics(metrics_dict)
  File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 255, in _write_metrics
    all_metrics_dict = comm.gather(metrics_dict)
  File "/podc/src/detectron2/detectron2/utils/comm.py", line 200, in gather
    size_list, tensor = _pad_to_largest_tensor(tensor, group)
  File "/podc/src/detectron2/detectron2/utils/comm.py", line 126, in _pad_to_largest_tensor
    dist.all_gather(size_list, local_size, group=group)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1158, in all_gather
    work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:572] Connection closed by peer [172.19.0.2]:13729
[02/05 23:07:15 d2.engine.hooks]: Overall training speed: 3803 iterations in 0:24:16 (0.3829 s / it)
[02/05 23:07:15 d2.engine.hooks]: Total training time: 0:24:19 (0:00:02 on hooks)

Expected behavior

Environment:

------------------------  ---------------------------------------------------------
sys.platform              linux
Python                    3.6.9 (default, Nov  7 2019, 10:44:02) [GCC 8.3.0]
numpy                     1.18.1
detectron2                0.1 @/podc/src/detectron2/detectron2
detectron2 compiler       GCC 7.4
detectron2 CUDA compiler  10.1
detectron2 arch flags     sm_60
DETECTRON2_ENV_MODULE     <not set>
PyTorch                   1.4.0 @/usr/local/lib/python3.6/dist-packages/torch
PyTorch debug build       False
CUDA available            True
GPU 0,1,2,3               Tesla P100-SXM2-16GB
CUDA_HOME                 /usr/local/cuda
NVCC                      Cuda compilation tools, release 10.1, V10.1.243
Pillow                    6.2.2
torchvision               0.5.0 @/usr/local/lib/python3.6/dist-packages/torchvision
torchvision arch flags    sm_35, sm_50, sm_60, sm_70, sm_75
------------------------  ---------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:10

Top GitHub Comments

2reactions

roger1993commented, Feb 9, 2020

@roger1993 have you met this kind of warnings? [02/09 08:25:25 d2.utils.events]: eta: 17 days, 5:56:44 iter: 99 total_loss: 0.334 loss_cls: 0.024 loss_box_reg: 0.030 loss_mask: 0.109 loss_mask_point: 0.152 loss_rpn_cls: 0.002 loss_rpn_loc: 0.018 time: 5.8734 data_time: 3.7765 lr: 0.001998 max_mem: 11965M /opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 44 bytes but only got 40. Skipping tag 37510 " Skipping tag %s" % (size, len(data), tag) /opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 8 bytes but only got 0. Skipping tag 41730 " Skipping tag %s" % (size, len(data), tag)

Because my dataset is from industry and it is very dirty, use read_image in detectron2 would cause training fails, so I change the read_image function from pillow to opencv at https://github.com/facebookresearch/detectron2/blob/master/detectron2/data/detection_utils.py#L36, u can roughly refers to this issue https://github.com/facebookresearch/detectron2/issues/788, I don’t have the same problem as urs

1reaction

CMobley7commented, Feb 10, 2020

So, running it with 1 GPU eventually failed because there were a few corrupt JPEGs in the dataset I was using. I re-downloaded the dataset and ran with both 1 GPU and 4 GPUs on 3 different boxes and didn’t receive the error again. Also, note, that I rebuilt my docker images before running this second test. When installing torchvision 0.5.0, Pillow 7.0.0 instead of 6.2.2 was installed. I think the ultimate issue was issues with certain images in my dataset. So, as @roger1993, I’d suggest either cleaning your data or making further updates to the read_image function to ensure all images that cause an error are skipped. However, it would be nice to see the default read_image function updated; so that, problematic images are skipped instead of giving an arbitrary Connection closed by peer error. Should I close this issue since my issue is solved or leave it open; so that, potentially work on the read_image function can be tied to an issue.