Training stucks with hrnet: problems while loading data?
See original GitHub issueFirst of all, thank you for this very good repository!! I already launched successfully a training over a binary custom dataset and also got very good results in evaluation and visualization. I used the model with hrnet+c1. Now I am trying to train with another custom dataset, with 34 classes including “undefined” (which has been labeled as 0 to be ignored): everything starts as usual, but it seems to block while creating the iterator over the training dataset. Here is the output of my program until it stops and remains stuck:
Training started on sáb abr 4 16:37:42 CEST 2020
[2020-04-04 16:37:43,581 INFO train.py line 243 28431] Loaded configuration file ./config/customdataset-hrnetv2-c1.yaml
[2020-04-04 16:37:43,582 INFO train.py line 244 28431] Running with config:
DATASET:
imgMaxSize: 4000
imgSizes: (254, 267, 300, 350, 363, 372, 396, 400, 410, 420, 421, 425, 426, 429, 436, 440, 441, 456, 466, 467, 480, 496, 498, 500, 506, 525, 531, 538, 549, 559, 600, 605, 639, 640, 654, 662, 664, 680, 702, 714, 720, 750, 751, 768, 800, 808, 843, 860, 873, 900, 938, 954, 957, 960, 1000, 1015, 1024, 1025, 1080, 1087, 1102, 1118, 1200, 1283, 1333, 1390, 1600, 1789, 2000, 2247, 2332, 2400, 3000, 3079, 3264)
list_train: data/customdataset/training.odgt
list_val: data/customdataset/validation.odgt
num_class: 34
padding_constant: 32
random_flip: True
root_dataset: data/customdataset/
segm_downsampling_rate: 4
DIR: ckpt/customdataset-hrnetv2-c1
MODEL:
arch_decoder: c1
arch_encoder: hrnetv2
fc_dim: 720
weights_decoder:
weights_encoder:
TEST:
batch_size: 1
checkpoint: epoch_4.pth
result: ./result/customdataset/exp01
TRAIN:
batch_size_per_gpu: 2
beta1: 0.9
deep_sup_scale: 0.4
disp_iter: 1
epoch_iters: 5000
fix_bn: False
lr_decoder: 0.02
lr_encoder: 0.02
lr_pow: 0.9
num_epoch: 4
optim: SGD
seed: 304
start_epoch: 0
weight_decay: 0.0001
workers: 16
VAL:
batch_size: 1
checkpoint: epoch_4.pth
visualize: True
[2020-04-04 16:37:43,582 INFO train.py line 249 28431] Outputing checkpoints to: ckpt/customdataset-hrnetv2-c1
# samples: 80
1 Epoch = 5000 iters
The pretrained model I am using is ade20k-hrnetv2-c1
and the same was for the previous experiments.
I put some trivial print()
in train.py
, after every step that follows the last information actually printed. It seems that there are problems in creating the iterator of the training data:
[178] print('1 Epoch = {} iters'.format(cfg.TRAIN.epoch_iters)) # this appears
[179]
[180] # create loader iterator
[181] iterator_train = iter(loader_train)
[182] print('Iterator train created') # this does not appear
The GPU memory usage seems to confirm this guess: I am monitoring with nvidia-smi
2 Nvidia GPUs, Pascal Quadro P6000 + Titan RTX, both with 24GB memory (I understand that maybe it is not correct to use different architectures?). In the previous trainings, everything worked with both memories occupied more or less for the 75%, equally distributed as I expected. Now, instead, after an initial increase in memory usage, it stucks whith very unbalanced and low memory usage.
What am I doing wrong? Any help is appreciated.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (1 by maintainers)
Top GitHub Comments
I guess it is memory issue. Can you try to reduce the batch size?
Yes, thank you. I set the batch size per GPU to 2 and it is working now.