Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory errors & Focal Loss error with lung segmentation model

See original GitHub issue

Hello,

I’ve been having some trouble getting the sample lung segmentation experiment to complete succesfully. Submitting the task to STANDARD_DS3_V2 and STANDARD_DS12_V2 CPUs yielded a dataloader error (shown below). ValueError: At least one component of the runner failed: Training failed: DataLoader worker (pid(s) 275) exited unexpectedly

--num_dataload_workers is set at 8 by default, so I lowered it by passing --num_dataload_workers=0 and --train_batch_size=1. This then yielded a memory error (shown below) when I ran it on either CPU.

ValueError: At least one component of the runner failed: Training failed: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 2774532096 bytes. Error code 12 (Cannot allocate memory)

Then, I tried running it on the STANDARD_NC6_GPU. Here, with parameters --num_dataload_workers=0 and train_batch_size=1 I received the below error. It looks to be raised here in InnerEye’s code. I’ve attached the driver_log for the run that produced this error as well. ValueError: At least one component of the runner failed: Training failed: Focal loss is supported only for one-hot encoded targets

Also note that if there are too many workers or batch size is too high, even the STANDARD_NC6_GPU will produce a CUDA memory error (shown below). Training failed: CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 11.17 GiB total capacity; 10.77 GiB already allocated; 57.81 MiB free; 10.79 GiB reserved in total by PyTorch)

Is there a particular compute target that should be used to avoid these memory errors? And is there a way to get around the focal loss error?

AB#3881

Issue Analytics

State:
Created 3 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

shreyasingh1commented, Mar 1, 2021

Awesome thank you, I was able to get back to the earlier Training failed: Focal loss is supported only for one-hot encoded targets error, which looks to be because of the issue you referenced. I’ll take a stab at #339 and open a PR if I’m successful!

0reactions

ant0nsccommented, Apr 6, 2021

Closing because the remainder of the issue is covered in #339