Connection closed by peer error after ~4000 iterations
See original GitHub issueInstructions To Reproduce the Issue:
- what changes you made (
git diff
) or what code you wrote
In train_net.py
I use register_coco_instances()
to register my train and validation datasets.
- what exact command you run: I run
tools/train_net.py --num-gpus 4 --config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml DATASETS.TRAIN ('mydataset_train',) DATASETS.TEST ('mydataset_val',) SOLVER.IMS_PER_BATCH 8 SOLVER.BASE_LR 0.01 OUTPUT_DIR /home/$USER/shared
within a Docker container.
- what you observed (including the full logs): After ~3750 iterations, I keep getting the following error. If I use less than ~3750, everything finishes successfully.
[02/05 23:05:59 d2.utils.events]: eta: 9:07:04 iter: 3659 total_loss: 1.117 loss_cls: 0.345 loss_box_reg: 0.285 loss_mask: 0.343 loss_rpn_cls: 0.077 loss_rpn_loc: 0.105 time: 0.3771 data_time: 0.0084 lr: 0.010000 max_mem: 3402M
[02/05 23:06:07 d2.utils.events]: eta: 9:07:43 iter: 3679 total_loss: 1.123 loss_cls: 0.355 loss_box_reg: 0.285 loss_mask: 0.353 loss_rpn_cls: 0.065 loss_rpn_loc: 0.092 time: 0.3772 data_time: 0.0082 lr: 0.010000 max_mem: 3402M
[02/05 23:06:14 d2.utils.events]: eta: 9:08:00 iter: 3699 total_loss: 1.137 loss_cls: 0.329 loss_box_reg: 0.300 loss_mask: 0.348 loss_rpn_cls: 0.068 loss_rpn_loc: 0.081 time: 0.3773 data_time: 0.0083 lr: 0.010000 max_mem: 3402M
[02/05 23:06:22 d2.utils.events]: eta: 9:07:50 iter: 3719 total_loss: 1.175 loss_cls: 0.351 loss_box_reg: 0.297 loss_mask: 0.360 loss_rpn_cls: 0.066 loss_rpn_loc: 0.096 time: 0.3773 data_time: 0.0089 lr: 0.010000 max_mem: 3402M
[02/05 23:06:30 d2.utils.events]: eta: 9:07:50 iter: 3739 total_loss: 1.186 loss_cls: 0.334 loss_box_reg: 0.296 loss_mask: 0.353 loss_rpn_cls: 0.079 loss_rpn_loc: 0.111 time: 0.3774 data_time: 0.0079 lr: 0.010000 max_mem: 3402M
[02/05 23:06:38 d2.utils.events]: eta: 9:08:46 iter: 3759 total_loss: 1.160 loss_cls: 0.359 loss_box_reg: 0.283 loss_mask: 0.355 loss_rpn_cls: 0.070 loss_rpn_loc: 0.081 time: 0.3774 data_time: 0.0084 lr: 0.010000 max_mem: 3402M
[02/05 23:06:45 d2.utils.events]: eta: 9:08:46 iter: 3779 total_loss: 1.285 loss_cls: 0.364 loss_box_reg: 0.303 loss_mask: 0.376 loss_rpn_cls: 0.070 loss_rpn_loc: 0.092 time: 0.3775 data_time: 0.0083 lr: 0.010000 max_mem: 3402M
[02/05 23:06:53 d2.utils.events]: eta: 9:08:20 iter: 3799 total_loss: 1.227 loss_cls: 0.340 loss_box_reg: 0.300 loss_mask: 0.347 loss_rpn_cls: 0.084 loss_rpn_loc: 0.115 time: 0.3775 data_time: 0.0085 lr: 0.010000 max_mem: 3402M
ERROR [02/05 23:07:15 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 220, in run_step
self._write_metrics(metrics_dict)
File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 255, in _write_metrics
all_metrics_dict = comm.gather(metrics_dict)
File "/podc/src/detectron2/detectron2/utils/comm.py", line 200, in gather
size_list, tensor = _pad_to_largest_tensor(tensor, group)
File "/podc/src/detectron2/detectron2/utils/comm.py", line 126, in _pad_to_largest_tensor
dist.all_gather(size_list, local_size, group=group)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1158, in all_gather
work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:572] Connection closed by peer [172.19.0.2]:13729
[02/05 23:07:15 d2.engine.hooks]: Overall training speed: 3803 iterations in 0:24:16 (0.3829 s / it)
[02/05 23:07:15 d2.engine.hooks]: Total training time: 0:24:19 (0:00:02 on hooks)
Expected behavior
Environment:
------------------------ ---------------------------------------------------------
sys.platform linux
Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0]
numpy 1.18.1
detectron2 0.1 @/podc/src/detectron2/detectron2
detectron2 compiler GCC 7.4
detectron2 CUDA compiler 10.1
detectron2 arch flags sm_60
DETECTRON2_ENV_MODULE <not set>
PyTorch 1.4.0 @/usr/local/lib/python3.6/dist-packages/torch
PyTorch debug build False
CUDA available True
GPU 0,1,2,3 Tesla P100-SXM2-16GB
CUDA_HOME /usr/local/cuda
NVCC Cuda compilation tools, release 10.1, V10.1.243
Pillow 6.2.2
torchvision 0.5.0 @/usr/local/lib/python3.6/dist-packages/torchvision
torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75
------------------------ ---------------------------------------------------------
PyTorch built with:
- GCC 7.3
- Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.3
- Magma 2.5.1
- Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:10
Top Results From Across the Web
HEC Error : Connection closed by peer - Splunk Community
It and of itself, may not indicate any issue. It may have just been that the remote client had finished sending data and...
Read more >Socket Bind Error - java - Stack Overflow
When a socket is closed down, it is put in a special time-wait state for a certain amount of time. This is usually...
Read more >Connection reset by peer. Remote backup service might be ...
Hello i am trying to back up nas to nas, in the same lan and i get an error Connection reset by peer....
Read more >fsockopen - Manual - PHP
fsockopen — Open Internet or Unix domain socket connection ... The error will only become apparent when you read or write data to/from...
Read more >Error Messages - Verastream Host Integrator - Micro Focus
This error occurs when a session fails to initialize in a timely manner. Detail the exact circumstances that generated this error and contact...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Because my dataset is from industry and it is very dirty, use read_image in detectron2 would cause training fails, so I change the read_image function from pillow to opencv at https://github.com/facebookresearch/detectron2/blob/master/detectron2/data/detection_utils.py#L36, u can roughly refers to this issue https://github.com/facebookresearch/detectron2/issues/788, I don’t have the same problem as urs
So, running it with 1 GPU eventually failed because there were a few corrupt JPEGs in the dataset I was using. I re-downloaded the dataset and ran with both 1 GPU and 4 GPUs on 3 different boxes and didn’t receive the error again. Also, note, that I rebuilt my docker images before running this second test. When installing torchvision 0.5.0, Pillow 7.0.0 instead of 6.2.2 was installed. I think the ultimate issue was issues with certain images in my dataset. So, as @roger1993, I’d suggest either cleaning your data or making further updates to the
read_image
function to ensure all images that cause an error are skipped. However, it would be nice to see the defaultread_image
function updated; so that, problematic images are skipped instead of giving an arbitraryConnection closed by peer
error. Should I close this issue since my issue is solved or leave it open; so that, potentially work on theread_image
function can be tied to an issue.