question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training stucks with hrnet: problems while loading data?

See original GitHub issue

First of all, thank you for this very good repository!! I already launched successfully a training over a binary custom dataset and also got very good results in evaluation and visualization. I used the model with hrnet+c1. Now I am trying to train with another custom dataset, with 34 classes including “undefined” (which has been labeled as 0 to be ignored): everything starts as usual, but it seems to block while creating the iterator over the training dataset. Here is the output of my program until it stops and remains stuck:

Training started on sáb abr  4 16:37:42 CEST 2020
[2020-04-04 16:37:43,581 INFO train.py line 243 28431] Loaded configuration file ./config/customdataset-hrnetv2-c1.yaml
[2020-04-04 16:37:43,582 INFO train.py line 244 28431] Running with config:
DATASET:
  imgMaxSize: 4000
  imgSizes: (254, 267, 300, 350, 363, 372, 396, 400, 410, 420, 421, 425, 426, 429, 436, 440, 441, 456, 466, 467, 480, 496, 498, 500, 506, 525, 531, 538, 549, 559, 600, 605, 639, 640, 654, 662, 664, 680, 702, 714, 720, 750, 751, 768, 800, 808, 843, 860, 873, 900, 938, 954, 957, 960, 1000, 1015, 1024, 1025, 1080, 1087, 1102, 1118, 1200, 1283, 1333, 1390, 1600, 1789, 2000, 2247, 2332, 2400, 3000, 3079, 3264)
  list_train: data/customdataset/training.odgt
  list_val: data/customdataset/validation.odgt
  num_class: 34
  padding_constant: 32
  random_flip: True
  root_dataset: data/customdataset/
  segm_downsampling_rate: 4
DIR: ckpt/customdataset-hrnetv2-c1
MODEL:
  arch_decoder: c1
  arch_encoder: hrnetv2
  fc_dim: 720
  weights_decoder: 
  weights_encoder: 
TEST:
  batch_size: 1
  checkpoint: epoch_4.pth
  result: ./result/customdataset/exp01
TRAIN:
  batch_size_per_gpu: 2
  beta1: 0.9
  deep_sup_scale: 0.4
  disp_iter: 1
  epoch_iters: 5000
  fix_bn: False
  lr_decoder: 0.02
  lr_encoder: 0.02
  lr_pow: 0.9
  num_epoch: 4
  optim: SGD
  seed: 304
  start_epoch: 0
  weight_decay: 0.0001
  workers: 16
VAL:
  batch_size: 1
  checkpoint: epoch_4.pth
  visualize: True
[2020-04-04 16:37:43,582 INFO train.py line 249 28431] Outputing checkpoints to: ckpt/customdataset-hrnetv2-c1
# samples: 80
1 Epoch = 5000 iters

The pretrained model I am using is ade20k-hrnetv2-c1 and the same was for the previous experiments. I put some trivial print() in train.py, after every step that follows the last information actually printed. It seems that there are problems in creating the iterator of the training data:

[178] print('1 Epoch = {} iters'.format(cfg.TRAIN.epoch_iters)) # this appears
[179]
[180] # create loader iterator
[181] iterator_train = iter(loader_train)
[182] print('Iterator train created') # this does not appear

The GPU memory usage seems to confirm this guess: I am monitoring with nvidia-smi 2 Nvidia GPUs, Pascal Quadro P6000 + Titan RTX, both with 24GB memory (I understand that maybe it is not correct to use different architectures?). In the previous trainings, everything worked with both memories occupied more or less for the 75%, equally distributed as I expected. Now, instead, after an initial increase in memory usage, it stucks whith very unbalanced and low memory usage.

What am I doing wrong? Any help is appreciated.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
hangzhaomitcommented, Jul 9, 2020

I guess it is memory issue. Can you try to reduce the batch size?

0reactions
Mary-h86commented, Jul 9, 2020

I guess it is memory issue. Can you try to reduce the batch size?

Yes, thank you. I set the batch size per GPU to 2 and it is working now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training on Custom Data · Issue #37 - GitHub
How to do training from scratch on a custom dataset (with base model without any pretraining on CityScapes or any other dataset) with...
Read more >
Semantic Segmentation of Urban Buildings Using a High ...
In this study, building extraction in aerial images was performed using csAG-HRNet by applying HRNet-v2 in combination with channel and spatial attention ...
Read more >
Overview of Human Pose Estimation Neural Networks - 2d3d.ai
During training, HRNet uses the annotated bounding boxes of the given dataset. Two data sets are used for training and evaluating the network....
Read more >
Performance analysis of lightweight CNN models to segment ...
Trained on our dataset of more than 3,000 images, HR Net was found ... In the training data set, to counterbalance the diverse...
Read more >
Nuclei Detection using UNet and HRNet - Medium
Computer vision is a field of study focused on the problem of ... After that I split the training data into training and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found