Training of CMU dataset gets stuck on batch 1
See original GitHub issueHi, I’m trying to train the volumetric model on the CMU dataset, based on the train/val splits noted in issue #19. I am using 4 RTX 2080Ti GPUs.
Training is perfectly ok, but when the evaluation reaches batch 1, the entire evaluation halts and hangs for a very long time (almost a day) before I have to stop it. It is a reproducible problem: you can try running it from my forked repository here, following the CMU preprocessing instructions and running ./scripts/train_cmu
.
Interestingly, if training is skipped and only evaluation is run, batch 1 takes a while (say 15 min) but eventually completes and continues. I am not sure why the problem seems to only lie with batch 1. However, when combined with training, the evaluation at batch 1 hangs consistently and indefinitely.
At first, I suspected that there was some memory issue and thus reduced the batch size to 1 (for both train and val) and num_workers to 3 and 2 respectively. It still did not solve the problem. Right now, I am testing with just skipping the batch.
However, this still does not solve the root of the problem:
- Did you guys encounter similar issues during your training?
- What do you guys think may be the actual issue here?
Thank you!
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
Update 4: Seems to work after upgrade to latest version of pytorch 1.5.0
Update 2: I believe that the problem lies with the fact that once one of the sub-processes on one GPU finishes (and so that GPU is free), it moves on to load the eval Dataloader process instead?
Update 3: It runs fine on a single-GPU, but I would really like to train using our multiple-GPUs otherwise it’ll take too long
Note possible related issue here: https://github.com/pytorch/pytorch/issues/19996