Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training of CMU dataset gets stuck on batch 1

See original GitHub issue

Hi, I’m trying to train the volumetric model on the CMU dataset, based on the train/val splits noted in issue #19. I am using 4 RTX 2080Ti GPUs.

Training is perfectly ok, but when the evaluation reaches batch 1, the entire evaluation halts and hangs for a very long time (almost a day) before I have to stop it. It is a reproducible problem: you can try running it from my forked repository here, following the CMU preprocessing instructions and running ./scripts/train_cmu.

Interestingly, if training is skipped and only evaluation is run, batch 1 takes a while (say 15 min) but eventually completes and continues. I am not sure why the problem seems to only lie with batch 1. However, when combined with training, the evaluation at batch 1 hangs consistently and indefinitely.

At first, I suspected that there was some memory issue and thus reduced the batch size to 1 (for both train and val) and num_workers to 3 and 2 respectively. It still did not solve the problem. Right now, I am testing with just skipping the batch.

However, this still does not solve the root of the problem:

Did you guys encounter similar issues during your training?
What do you guys think may be the actual issue here?

Thank you!

Issue Analytics

State:
Created 3 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

3reactions

Samleo8commented, Jun 1, 2020

Update 4: Seems to work after upgrade to latest version of pytorch 1.5.0

0reactions

Samleo8commented, Jun 1, 2020

Update 2: I believe that the problem lies with the fact that once one of the sub-processes on one GPU finishes (and so that GPU is free), it moves on to load the eval Dataloader process instead?

Update 3: It runs fine on a single-GPU, but I would really like to train using our multiple-GPUs otherwise it’ll take too long

Note possible related issue here: https://github.com/pytorch/pytorch/issues/19996

Top Results From Across the Web

Homework 1 Bonus | Deep Learning, CMU

Since each batch contains a random subsample of the dataset, we assume that each batch is somewhat representative of the entire dataset.

Estimator training hangs in multiple gpu if dataset doesn't ...

Basically, if the dataset doesn't have enough elements to feed both gpus last batches the training hangs. If you doesn't have enough to...

Batch stuck in Waiting for Class Training | Decipher

Hi Shweta, The most likely cause of that is that you have already trained a batch on the classification model. When training the...

Competence-based Curriculum Learning for Neural Machine ...

Carnegie Mellon University ... specialized learning rates and large-batch training. ... can perform better if training data is presented in.

TensorFlow keeps consuming system memory and stuck ...

And my model will stop training after this. I have tried to change the batch size, but it still does not work. Model...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Training of CMU dataset gets stuck on batch 1

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Cannot run eval on pretrained models, instructions unclear

Does use_gt_pelvis correspond to MPJPE relative to pelvis in the paper?