question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training of CMU dataset gets stuck on batch 1

See original GitHub issue

Hi, I’m trying to train the volumetric model on the CMU dataset, based on the train/val splits noted in issue #19. I am using 4 RTX 2080Ti GPUs.

Training is perfectly ok, but when the evaluation reaches batch 1, the entire evaluation halts and hangs for a very long time (almost a day) before I have to stop it. It is a reproducible problem: you can try running it from my forked repository here, following the CMU preprocessing instructions and running ./scripts/train_cmu.

Interestingly, if training is skipped and only evaluation is run, batch 1 takes a while (say 15 min) but eventually completes and continues. I am not sure why the problem seems to only lie with batch 1. However, when combined with training, the evaluation at batch 1 hangs consistently and indefinitely.

At first, I suspected that there was some memory issue and thus reduced the batch size to 1 (for both train and val) and num_workers to 3 and 2 respectively. It still did not solve the problem. Right now, I am testing with just skipping the batch.

However, this still does not solve the root of the problem:

  1. Did you guys encounter similar issues during your training?
  2. What do you guys think may be the actual issue here?

Thank you!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
Samleo8commented, Jun 1, 2020

Update 4: Seems to work after upgrade to latest version of pytorch 1.5.0

0reactions
Samleo8commented, Jun 1, 2020

Update 2: I believe that the problem lies with the fact that once one of the sub-processes on one GPU finishes (and so that GPU is free), it moves on to load the eval Dataloader process instead?

Update 3: It runs fine on a single-GPU, but I would really like to train using our multiple-GPUs otherwise it’ll take too long

Note possible related issue here: https://github.com/pytorch/pytorch/issues/19996

Read more comments on GitHub >

github_iconTop Results From Across the Web

Homework 1 Bonus | Deep Learning, CMU
Since each batch contains a random subsample of the dataset, we assume that each batch is somewhat representative of the entire dataset.
Read more >
Estimator training hangs in multiple gpu if dataset doesn't ...
Basically, if the dataset doesn't have enough elements to feed both gpus last batches the training hangs. If you doesn't have enough to...
Read more >
Batch stuck in Waiting for Class Training | Decipher
Hi Shweta, The most likely cause of that is that you have already trained a batch on the classification model. When training the...
Read more >
Competence-based Curriculum Learning for Neural Machine ...
Carnegie Mellon University ... specialized learning rates and large-batch training. ... can perform better if training data is presented in.
Read more >
TensorFlow keeps consuming system memory and stuck ...
And my model will stop training after this. I have tried to change the batch size, but it still does not work. Model...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found