question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HELP! Error during training LIDC dataset

See original GitHub issue

Hello Sir. Paul, I already converted the LIDC Database, however after I run python exec.py --mode train --exp_source experiments/lidc_exp/ --exp_dir LIDC-Retina-model, the training stuck (it shows validate) on folds 1. Note: I change num_epoch into 50 and num_trainbatches into 10 since I just use 10 sample dataset.

CLI message:

starting training epoch 50 tr. batch 1/10 (ep. 50) fw 2.251s / bw 0.743s / total 2.993s || loss: 1.03, class: 0.89, bbox: 0.14 tr. batch 2/10 (ep. 50) fw 2.532s / bw 0.744s / total 3.276s || loss: 0.89, class: 0.66, bbox: 0.23 tr. batch 3/10 (ep. 50) fw 2.392s / bw 0.742s / total 3.134s || loss: 0.74, class: 0.73, bbox: 0.01 tr. batch 4/10 (ep. 50) fw 2.535s / bw 0.517s / total 3.053s || loss: 0.47, class: 0.47, bbox: 0.00 tr. batch 5/10 (ep. 50) fw 3.106s / bw 0.744s / total 3.850s || loss: 0.78, class: 0.71, bbox: 0.08 tr. batch 6/10 (ep. 50) fw 2.920s / bw 0.742s / total 3.662s || loss: 0.52, class: 0.49, bbox: 0.03 tr. batch 7/10 (ep. 50) fw 2.220s / bw 0.747s / total 2.967s || loss: 0.67, class: 0.56, bbox: 0.11 tr. batch 8/10 (ep. 50) fw 2.164s / bw 0.758s / total 2.921s || loss: 0.57, class: 0.51, bbox: 0.06 tr. batch 9/10 (ep. 50) fw 2.333s / bw 0.750s / total 3.082s || loss: 0.80, class: 0.70, bbox: 0.10 tr. batch 10/10 (ep. 50) fw 2.390s / bw 0.760s / total 3.150s || loss: 0.70, class: 0.66, bbox: 0.03 evaluating in mode train evaluating with match_iou: 0.1 starting validation in mode val_sampling. evaluating in mode val_sampling evaluating with match_iou: 0.1 non none scores: [0.00000000e+00 0.00000000e+00 0.00000000e+00 1.33691776e-04 1.12577370e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 3.19541394e-06 0.00000000e+00 0.00000000e+00 0.00000000e+00 6.34394073e-05 3.46760788e-04 0.00000000e+00 6.57964466e-05 6.30265885e-06 1.83419772e-04 0.00000000e+00 0.00000000e+00 3.13401814e-05 0.00000000e+00 0.00000000e+00 8.20894272e-05 4.21034540e-06 1.00719716e-03 7.65382661e-07 1.39219383e-05 7.98896203e-04 0.00000000e+00 2.30329873e-04 2.08085640e-04 1.10898187e-06 0.00000000e+00 0.00000000e+00 1.11219310e-05 1.91517091e-04 1.70706726e-04 1.07269665e-06 0.00000000e+00 0.00000000e+00 4.47997328e-05 0.00000000e+00 1.04838946e-06 1.86664529e-03 5.89871320e-06 1.97787268e-04] trained epoch 50: took 212.29711294174194 sec. (41.897600412368774 train / 170.39951252937317 val) plotting predictions from validation sampling. starting testing model of fold 0 in exp LIDC-Retina-TrainTest feature map shapes: [[32 32 64] [16 16 32] [ 8 8 16] [ 4 4 8]] anchor scales: {‘z’: [[2, 2.5198420997897464, 3.1748021039363987], [4, 5.039684199579493, 6.3496042078727974], [8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119]], ‘xy’: [[8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119], [32, 40.31747359663594, 50.79683366298238], [64, 80.63494719327188, 101.59366732596476]]} level 0: built anchors (589824, 6) / expected anchors 589824 ||| total build (589824, 6) / total expected 673920 level 1: built anchors (73728, 6) / expected anchors 73728 ||| total build (663552, 6) / total expected 673920 level 2: built anchors (9216, 6) / expected anchors 9216 ||| total build (672768, 6) / total expected 673920 level 3: built anchors (1152, 6) / expected anchors 1152 ||| total build (673920, 6) / total expected 673920 using default pytorch weight init subset: selected 2 instances from df data set loaded with: 2 test patients tmp ensembling over rank_ix:0 epoch:LIDC-Retina-TrainTest/fold_0/48_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) tmp ensembling over rank_ix:1 epoch:LIDC-Retina-TrainTest/fold_0/29_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) tmp ensembling over rank_ix:2 epoch:LIDC-Retina-TrainTest/fold_0/32_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) tmp ensembling over rank_ix:3 epoch:LIDC-Retina-TrainTest/fold_0/17_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) tmp ensembling over rank_ix:4 epoch:LIDC-Retina-TrainTest/fold_0/34_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) finished predicting test set. starting post-processing of predictions. applying wcs to test set predictions with iou = 1e-05 and n_ens = 20. applying 2Dto3D merging to test set predictions with iou = 0.1. evaluating in mode test evaluating with match_iou: 0.1 /home/ivan/.virtualenvs/virtual-py3/lib/python3.5/site-packages/numpy/core/fromnumeric.py:2920: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /home/ivan/.virtualenvs/virtual-py3/lib/python3.5/site-packages/numpy/core/_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) /home/ivan/.virtualenvs/virtual-py3/lib/python3.5/site-packages/matplotlib/axes/_base.py:3364: UserWarning: Attempting to set identical bottom==top results in singular transformations; automatically expanding. bottom=1.0, top=1.0 self.set_ylim(upper, lower, auto=None) Logging to LIDC-Retina-TrainTest/fold_1/exec.log performing training in 3D over fold 1 on experiment LIDC-Retina-TrainTest with model retina_net performing training in 3D over fold 1 on experiment LIDC-Retina-TrainTest with model retina_net feature map shapes: [[32 32 64] [16 16 32] [ 8 8 16] [ 4 4 8]] feature map shapes: [[32 32 64] [16 16 32] [ 8 8 16] [ 4 4 8]] anchor scales: {‘z’: [[2, 2.5198420997897464, 3.1748021039363987], [4, 5.039684199579493, 6.3496042078727974], [8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119]], ‘xy’: [[8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119], [32, 40.31747359663594, 50.79683366298238], [64, 80.63494719327188, 101.59366732596476]]} anchor scales: {‘z’: [[2, 2.5198420997897464, 3.1748021039363987], [4, 5.039684199579493, 6.3496042078727974], [8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119]], ‘xy’: [[8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119], [32, 40.31747359663594, 50.79683366298238], [64, 80.63494719327188, 101.59366732596476]]} level 0: built anchors (589824, 6) / expected anchors 589824 ||| total build (589824, 6) / total expected 673920 level 0: built anchors (589824, 6) / expected anchors 589824 ||| total build (589824, 6) / total expected 673920 level 1: built anchors (73728, 6) / expected anchors 73728 ||| total build (663552, 6) / total expected 673920 level 1: built anchors (73728, 6) / expected anchors 73728 ||| total build (663552, 6) / total expected 673920 level 2: built anchors (9216, 6) / expected anchors 9216 ||| total build (672768, 6) / total expected 673920 level 2: built anchors (9216, 6) / expected anchors 9216 ||| total build (672768, 6) / total expected 673920 level 3: built anchors (1152, 6) / expected anchors 1152 ||| total build (673920, 6) / total expected 673920 level 3: built anchors (1152, 6) / expected anchors 1152 ||| total build (673920, 6) / total expected 673920 using default pytorch weight init using default pytorch weight init loading dataset and initializing batch generators… loading dataset and initializing batch generators… data set loaded with: 6 train / 2 val / 2 test patients data set loaded with: 6 train / 2 val / 2 test patients starting training epoch 1 starting training epoch 1 tr. batch 1/10 (ep. 1) fw 1.901s / bw 0.557s / total 2.458s || loss: 0.55, class: 0.55, bbox: 0.00 tr. batch 1/10 (ep. 1) fw 1.901s / bw 0.557s / total 2.458s || loss: 0.55, class: 0.55, bbox: 0.00 tr. batch 2/10 (ep. 1) fw 2.057s / bw 0.777s / total 2.834s || loss: 0.77, class: 0.69, bbox: 0.08 tr. batch 2/10 (ep. 1) fw 2.057s / bw 0.777s / total 2.834s || loss: 0.77, class: 0.69, bbox: 0.08 tr. batch 3/10 (ep. 1) fw 1.838s / bw 0.515s / total 2.353s || loss: 0.77, class: 0.77, bbox: 0.00 tr. batch 3/10 (ep. 1) fw 1.838s / bw 0.515s / total 2.353s || loss: 0.77, class: 0.77, bbox: 0.00 tr. batch 4/10 (ep. 1) fw 1.803s / bw 0.741s / total 2.544s || loss: 0.94, class: 0.83, bbox: 0.11 tr. batch 4/10 (ep. 1) fw 1.803s / bw 0.741s / total 2.544s || loss: 0.94, class: 0.83, bbox: 0.11 tr. batch 5/10 (ep. 1) fw 1.717s / bw 0.741s / total 2.458s || loss: 0.85, class: 0.76, bbox: 0.09 tr. batch 5/10 (ep. 1) fw 1.717s / bw 0.741s / total 2.458s || loss: 0.85, class: 0.76, bbox: 0.09 tr. batch 6/10 (ep. 1) fw 1.654s / bw 0.744s / total 2.398s || loss: 1.07, class: 0.90, bbox: 0.17 tr. batch 6/10 (ep. 1) fw 1.654s / bw 0.744s / total 2.398s || loss: 1.07, class: 0.90, bbox: 0.17 tr. batch 7/10 (ep. 1) fw 2.217s / bw 0.742s / total 2.959s || loss: 0.80, class: 0.69, bbox: 0.11 tr. batch 7/10 (ep. 1) fw 2.217s / bw 0.742s / total 2.959s || loss: 0.80, class: 0.69, bbox: 0.11 tr. batch 8/10 (ep. 1) fw 1.733s / bw 0.740s / total 2.473s || loss: 0.80, class: 0.69, bbox: 0.12 tr. batch 8/10 (ep. 1) fw 1.733s / bw 0.740s / total 2.473s || loss: 0.80, class: 0.69, bbox: 0.12 tr. batch 9/10 (ep. 1) fw 1.709s / bw 0.750s / total 2.459s || loss: 1.07, class: 0.89, bbox: 0.18 tr. batch 9/10 (ep. 1) fw 1.709s / bw 0.750s / total 2.459s || loss: 1.07, class: 0.89, bbox: 0.18 tr. batch 10/10 (ep. 1) fw 2.189s / bw 0.743s / total 2.932s || loss: 1.06, class: 0.89, bbox: 0.17 tr. batch 10/10 (ep. 1) fw 2.189s / bw 0.743s / total 2.932s || loss: 1.06, class: 0.89, bbox: 0.17 evaluating in mode train evaluating in mode train evaluating with match_iou: 0.1 evaluating with match_iou: 0.1 starting validation in mode val_sampling.

It just stuck at starting validation for more than 4 hours. Please help me. Thank you in advance Sir.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
pfjaegercommented, Apr 10, 2019

It’s line 225 …

On 10. Apr 2019, at 16:01, Ivan William H. <notifications@github.commailto:notifications@github.com> wrote:

Thank you for your answer Sir. For the reason why I used 10 datasets only and 10 epoch. Firstly because I want to try combine my own CT scan private dataset into the architecture, but before that I want to check what kind of output that will be produced by the RetinaNet3D architecture. The second reason, because I got the permission from my institution to use NVIDIA Tesla P100 16GB, so I think for 10 image, i will need only 10 epoch.

Could you give me some hint abount line 255 of dataloader?

      # if set to not None, add neighbouring slices to each selected slice in channel dimension.
        if self.cf.n_3D_context is not None:
            padded_data = dutils.pad_nd_image(data[0], [(data.shape[-1] + (self.cf.n_3D_context*2))], mode='constant')
            padded_slice_id = slice_id + self.cf.n_3D_context
            data = (np.concatenate([padded_data[..., ii][np.newaxis] for ii in range(
                padded_slice_id - self.cf.n_3D_context, padded_slice_id + self.cf.n_3D_context + 1)], axis=0))
        **else:
            data = data[..., slice_id]
        seg = seg[..., slice_id]**

Thank you Sir

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/pfjaeger/medicaldetectiontoolkit/issues/35#issuecomment-481703532, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVQq7U4-h2z6qIiF2_JNFIMaWmLxGN_tks5vfe6jgaJpZM4cnCcY.

0reactions
ivanwilliammdcommented, Apr 11, 2019

I’m sorry Sir, may I ask does this affect for manual --folds command?

python exec.py --mode train_test --folds 0 --exp_source experiments/lidc_exp/ --exp_dir LIDC-Retina-TrainTest works smoothly but

python exec.py --mode train_test --folds 1 --exp_source experiments/lidc_exp/ --exp_dir LIDC-Retina-TrainTest
python exec.py --mode train_test --folds 2 --exp_source experiments/lidc_exp/ --exp_dir LIDC-Retina-TrainTest
python exec.py --mode train_test --folds 3 --exp_source experiments/lidc_exp/ --exp_dir LIDC-Retina-TrainTest
python exec.py --mode train_test --folds 4 --exp_source experiments/lidc_exp/ --exp_dir LIDC-Retina-TrainTest

stuck after first epoch. @pfjaeger And to solve this error, is it true that I just need to increase the patients dataset first, is that right Sir? Thank you Sir

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Lung Image Database Consortium (LIDC) - NCBI - NIH
The mission of the LIDC is to develop the database as an “international research resource for the development, training, and evaluation of CAD...
Read more >
Data from The Lung Image Database Consortium (LIDC) and ...
The Lung Image Database Consortium image collection (LIDC-IDRI) consists of diagnostic and lung cancer screening thoracic computed tomography ( ...
Read more >
Re-thinking and Re-labeling LIDC-IDRI for Robust Pulmonary ...
We demonstrate in this paper that providing new labels by similar nodule retrieval based on metric learning would be an effective re-labeling ...
Read more >
Data - LUNA16 challenge
For this challenge, we use the publicly available LIDC/IDRI database. ... Note: The dataset is used for both training and testing dataset.
Read more >
Development and Validation of a Modified Three-Dimensional ...
The LIDC-IDRI dataset comprised 1018 CT scans from 1010 patients taken at seven ... The classification part helps in determining which linear combination...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found