question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

no training: Total training time: 0:00:00

See original GitHub issue

I debugged the training process. There is no problem loading the data set. But when it runs to the 205th line of /home/zhaojing/BCNet/detectron2/engine/train_loop.py, the training ends. data = next(self._data_loader_iter)

How to solve this problem,thank you!

The log is as follows: … [07/15 18:54:16 d2.data.datasets.coco]: Loading /repository01/nucleus_seg_data/instanceSeg/MoNuSeg/annotations/instances_train2017bcnet.json takes 16.16 seconds. [07/15 18:54:17 d2.data.datasets.coco]: Loaded 7680 images in COCO format from /repository01/nucleus_seg_data/instanceSeg/MoNuSeg/annotations/instances_train2017bcnet.json [07/15 19:00:58 d2.data.build]: Removed 0 images with no usable annotations. 7680 images left. [07/15 19:01:12 d2.data.build]: Distribution of instances among all 1 categories:

category #instances
building 310262

[07/15 19:01:16 d2.data.common]: Serializing 7680 elements to byte tensors and concatenating them all … [07/15 19:01:20 d2.data.common]: Serialized dataset takes 253.14 MiB [07/15 19:02:52 d2.data.detection_utils]: TransformGens used in training: [ResizeShortestEdge(short_edge_length=(600,), max_size=900, sample_style=‘choice’), RandomFlip()] [07/15 19:03:31 d2.data.build]: Using training sampler TrainingSampler [07/15 19:05:52 fvcore.common.checkpoint]: [Checkpointer] Loading from /home/zhaojing/BCNet/pretrainmodel/R-101.pkl … [07/15 19:05:52 d2.checkpoint.c2_model_loading]: Remapping C2 weights … [07/15 19:05:54 d2.checkpoint.c2_model_loading]: backbone.bottom_up.res2.0.conv1.norm.bias loaded from res2_0_branch2a_bn_beta of shape (64,) [07/15 19:05:54 d2.checkpoint.c2_model_loading]: backbone.bottom_up.res2.0.conv1.norm.running_mean loaded from res2_0_branch2a_bn_running_mean of shape (64,) [07/15 19:05:54 d2.checkpoint.c2_model_loading]: backbone.bottom_up.res2.0.conv1.norm.running_var loaded from res2_0_branch2a_bn_running_var of shape (64,) …

[07/15 16:52:25 d2.checkpoint.c2_model_loading]: The checkpoint state_dict contains keys that are not used by the model: fc1000_b fc1000_w [07/15 16:52:25 d2.engine.train_loop]: Starting training from iteration 0 [07/15 16:52:25 d2.engine.hooks]: Total training time: 0:00:00 (0:00:00 on hooks) [07/15 16:52:26 d2.data.datasets.coco]: Loaded 896 images in COCO format from /repository01/nucleus_seg_data/instanceSeg/MoNuSeg/annotations/instances_test2017bcnet.json [07/15 16:52:26 d2.data.build]: Distribution of instances among all 1 categories:

category #instances
building 24384

[07/15 16:52:26 d2.data.common]: Serializing 896 elements to byte tensors and concatenating them all … [07/15 16:52:26 d2.data.common]: Serialized dataset takes 7.96 MiB [07/15 16:52:27 d2.evaluation.evaluator]: Start inference on 896 images [07/15 16:52:28 d2.evaluation.evaluator]: Inference done 11/896. 0.0775 s / img. ETA=0:01:30 [07/15 16:52:33 d2.evaluation.evaluator]: Inference done 61/896. 0.0761 s / img. ETA=0:01:24


_libgcc_mutex             0.1                        main  
_openmp_mutex             4.5                       1_gnu  
blas                      1.0                         mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2021.7.5             h06a4308_1  
certifi                   2021.5.30        py37h06a4308_0  
cffi                      1.14.6           py37h400218f_0  
charset-normalizer        2.0.1                    pypi_0    pypi
cloudpickle               1.6.0                    pypi_0    pypi
cudatoolkit               10.0.130                      0    nvidia
cycler                    0.10.0                   pypi_0    pypi
cython                    3.0.0a8                  pypi_0    pypi
detectron2                0.1                       dev_0    <develop>
ffmpeg                    4.3                  hf484d3e_0    pytorch
freetype                  2.10.4               h5ab3b9f_0  
future                    0.18.2                   pypi_0    pypi
gmp                       6.2.1                h2531618_2  
gnutls                    3.6.15               he1e5248_0  
idna                      3.2                      pypi_0    pypi
intel-openmp              2021.2.0           h06a4308_610  
iopath                    0.1.9                    pypi_0    pypi
jpeg                      9b                            0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
kiwisolver                1.3.1                    pypi_0    pypi
lame                      3.100                h7b6447c_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.35.1               h7274673_9  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.3.0               h5101ec6_17  
libgfortran-ng            7.5.0               ha8ba4b0_17  
libgfortran4              7.5.0               ha8ba4b0_17  
libgomp                   9.3.0               h5101ec6_17  
libiconv                  1.14                          0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
libidn2                   2.3.1                h27cfd23_0  
libpng                    1.6.37               hbc83047_0  
libstdcxx-ng              9.3.0               hd4cf53a_17  
libtasn1                  4.16.0               h27cfd23_0  
libtiff                   4.2.0                h85742a9_0  
libunistring              0.9.10               h27cfd23_0  
libuv                     1.40.0               h7b6447c_0  
libwebp-base              1.2.0                h27cfd23_0  
lz4-c                     1.9.3                h2531618_0  
mkl                       2021.2.0           h06a4308_296  
mkl-service               2.4.0            py37h7f8727e_0  
mkl_fft                   1.3.0            py37h42c9631_2  
mkl_random                1.2.1            py37ha9443f7_2  
ncurses                   6.2                  he6710b0_1  
nettle                    3.7.3                hbbd107a_1  
ninja                     1.7.2                         0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
numpy                     1.20.2           py37h2d18471_0  
numpy-base                1.20.2           py37hfae3a4d_0  
olefile                   0.46                     py37_0  
openh264                  2.1.0                hd408876_0  
openjpeg                  2.3.0                h05c96fa_1  
openssl                   1.1.1k               h27cfd23_0  
pillow                    6.2.2                    pypi_0    pypi
pip                       21.1.3           py37h06a4308_0  
portalocker               2.3.0                    pypi_0    pypi
pycocotools               2.0                      pypi_0    pypi
pycparser                 2.20                       py_2  
pydot                     1.4.2                    pypi_0    pypi
pyparsing                 3.0.0b2                  pypi_0    pypi
python                    3.7.10               h12debd9_4  
python-dateutil           2.8.1                    pypi_0    pypi
pytorch                   1.4.0           py3.7_cuda10.0.130_cudnn7.6.3_0    pytorch
pywavelets                1.1.1            py37h7b6447c_2  
pyyaml                    5.4.1                    pypi_0    pypi
readline                  8.1                  h27cfd23_0  
requests                  2.26.0                   pypi_0    pypi
scipy                     1.6.2            py37had2a1c9_1  
setuptools                52.0.0           py37h06a4308_0  
six                       1.16.0             pyhd3eb1b0_0  
sqlite                    3.36.0               hc218d9a_0  
tabulate                  0.8.9                    pypi_0    pypi
tensorboard               2.5.0                    pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.0                    pypi_0    pypi
tk                        8.6.10               hbc83047_0  
torchaudio                0.4.0                      py37    pytorch
torchvision               0.5.0                py37_cu100    pytorch
tqdm                      4.61.2                   pypi_0    pypi
typing                    3.10.0.0         py37h06a4308_0  
typing_extensions         3.10.0.0           pyh06a4308_0  
urllib3                   1.26.6                   pypi_0    pypi
werkzeug                  2.0.1                    pypi_0    pypi
wheel                     0.36.2                   pypi_0    pypi
xz                        5.2.5                h7b6447c_0  
yacs                      0.1.8                    pypi_0    pypi
zlib                      1.2.11                        0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
zstd                      1.4.9                haebb681_0

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
jingzhaohljcommented, Jul 16, 2021

I inserted the following code in BCNet/detectron2/engine/train_loop.py. This problem was solved.

**torch.backends.cudnn.enabled = False
loss_dict = self.model(data, self.iter, self.max_iter)**
1reaction
chenxinfeng4commented, Sep 26, 2021

I had the same bug in windows 10. And I find it’s caused by the PYTORCH VERSION==1.4, which has been reported in github/pytorch before.

CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.

One solution is to upgrade PYTORCH==1.8, and TORCHVISION. The BCNet has many compatibility issues with PYTORCH_1.8 . Step by step debugging, finally, it works out.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Rest interval between sets in strength training - PubMed
When the training goal is muscular hypertrophy, the combination of moderate-intensity sets with short rest intervals of 30-60 seconds might be most effective ......
Read more >
Training slows down when using larger dataset #3713 - GitHub
the larger the dataset, the more time one chunk takes to train. At 200k samples, it's around 240 seconds per fit call, and...
Read more >
Problem with Dist GraphSAGE val/test evaluation after training
It usually took much longer time for evaluation comparing to training time, because evaluation need to be done on the whole graph.
Read more >
Durability | Anime Fighting Simulator Wiki - Fandom
Area Name Required Durability Multiplier Pirate Ship 100 (10 2 ) x5 Desert Island 10K (10 4 ) x15 Snowy Rock 100K (10 5 ) x50...
Read more >
CFR-2012-title5-vol1-part410.pdf - GovInfo
(b) Reports are due to OPM no later ... 410.310 Computing time in training. ... VerDate Mar<15>2010 10:35 Mar 13, 2012 Jkt 226008...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found