question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory errors & Focal Loss error with lung segmentation model

See original GitHub issue

Hello,

I’ve been having some trouble getting the sample lung segmentation experiment to complete succesfully. Submitting the task to STANDARD_DS3_V2 and STANDARD_DS12_V2 CPUs yielded a dataloader error (shown below). ValueError: At least one component of the runner failed: Training failed: DataLoader worker (pid(s) 275) exited unexpectedly

--num_dataload_workers is set at 8 by default, so I lowered it by passing --num_dataload_workers=0 and --train_batch_size=1. This then yielded a memory error (shown below) when I ran it on either CPU.

ValueError: At least one component of the runner failed: Training failed: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 2774532096 bytes. Error code 12 (Cannot allocate memory)

Then, I tried running it on the STANDARD_NC6_GPU. Here, with parameters --num_dataload_workers=0 and train_batch_size=1 I received the below error. It looks to be raised here in InnerEye’s code. I’ve attached the driver_log for the run that produced this error as well. ValueError: At least one component of the runner failed: Training failed: Focal loss is supported only for one-hot encoded targets

Also note that if there are too many workers or batch size is too high, even the STANDARD_NC6_GPU will produce a CUDA memory error (shown below). Training failed: CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 11.17 GiB total capacity; 10.77 GiB already allocated; 57.81 MiB free; 10.79 GiB reserved in total by PyTorch)

Is there a particular compute target that should be used to avoid these memory errors? And is there a way to get around the focal loss error?

AB#3881

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
shreyasingh1commented, Mar 1, 2021

Awesome thank you, I was able to get back to the earlier Training failed: Focal loss is supported only for one-hot encoded targets error, which looks to be because of the issue you referenced. I’ll take a stab at #339 and open a PR if I’m successful!

0reactions
ant0nsccommented, Apr 6, 2021

Closing because the remainder of the issue is covered in #339

Read more comments on GitHub >

github_iconTop Results From Across the Web

Data loader does not check that segmentation labels are mutually ...
If that assumption is violated, model trai... ... Memory errors & Focal Loss error with lung segmentation model #406.
Read more >
Improving Accuracy of Lung Nodule Classification Using Deep ...
In this paper, we propose a new deep learning method to improve classification accuracy of pulmonary nodules in computed tomography (CT) scans.
Read more >
An Improved Dice Loss for Pneumothorax Segmentation by ...
To solve this problem, the focal loss divides all the samples into easy and hard samples according to the confidence degree of the...
Read more >
Multi-loss ensemble deep learning for chest X-ray classification
[10] used the focal loss to train the models toward classifying CXRs into ... segment lungs in the pediatric pneumonia CXR collection [6]....
Read more >
Automated Lung Sound Classification Using a Hybrid CNN ...
3.6. Focal Loss Function. The loss function is a metric used while training a neural network to measure the error between the model's...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found