Memory errors & Focal Loss error with lung segmentation model
See original GitHub issueHello,
I’ve been having some trouble getting the sample lung segmentation experiment to complete succesfully. Submitting the task to STANDARD_DS3_V2 and STANDARD_DS12_V2 CPUs yielded a dataloader error (shown below).
ValueError: At least one component of the runner failed: Training failed: DataLoader worker (pid(s) 275) exited unexpectedly
--num_dataload_workers
is set at 8 by default, so I lowered it by passing --num_dataload_workers=0
and --train_batch_size=1
. This then yielded a memory error (shown below) when I ran it on either CPU.
ValueError: At least one component of the runner failed: Training failed: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 2774532096 bytes. Error code 12 (Cannot allocate memory)
Then, I tried running it on the STANDARD_NC6_GPU. Here, with parameters --num_dataload_workers=0
and train_batch_size=1
I received the below error. It looks to be raised here in InnerEye’s code. I’ve attached the driver_log for the run that produced this error as well.
ValueError: At least one component of the runner failed: Training failed: Focal loss is supported only for one-hot encoded targets
Also note that if there are too many workers or batch size is too high, even the STANDARD_NC6_GPU will produce a CUDA memory error (shown below).
Training failed: CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 11.17 GiB total capacity; 10.77 GiB already allocated; 57.81 MiB free; 10.79 GiB reserved in total by PyTorch)
Is there a particular compute target that should be used to avoid these memory errors? And is there a way to get around the focal loss error?
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (4 by maintainers)
Awesome thank you, I was able to get back to the earlier
Training failed: Focal loss is supported only for one-hot encoded targets
error, which looks to be because of the issue you referenced. I’ll take a stab at #339 and open a PR if I’m successful!Closing because the remainder of the issue is covered in #339