Scalability of monailabel (OOM errors)
See original GitHub issueDescribe the bug I have encountered two different situations where monai label is using far more memory than I would expect. Are these user errors, or related to my dataset? Has monai label been designed with scalability in mind?
- When I push train, my entire dataset is loaded into CPU RAM. Our dataset is larger than some of the competition datasets (BTCV or MSD) but not extremely large - roughly 100 CT scans that are 512x512xH, where H is usually in the range of about 500. Uncompressed, that adds up to nearly 100GB, which leads to the program crashing. Is there an option to avoid loading all data into RAM, and just load it on demand? Perhaps using pre-fetch to avoid creating a bottleneck? Since I am using the segmentation model, which trains on patches, perhaps it would be sufficient to load just the patches into RAM, rather than the full images?
In case it is relevant, my dataset has 12 foreground labels.
My work around solution is to use swap, but obviously that’s not ideal.
- After training, clicking
RUN
gives me another OOM error. I tried decreasing theroi_size
for my model, but even at64x64x64
I’m still exceeding the 8GB of GPU VRAM available:
For 128x128x128
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.25 GiB (GPU 0; 7.92 GiB total capacity; 440.46 MiB already allocated; 6.63 GiB free; 610.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
For 96x96x96
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.62 GiB (GPU 0; 7.92 GiB total capacity; 1.20 GiB already allocated; 5.46 GiB free; 1.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
For 64x64x64
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.62 GiB (GPU 0; 7.92 GiB total capacity; 1.20 GiB already allocated; 5.46 GiB free; 1.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
It might be expected behaviour for deepedit
models to cause an OOM, since they run on the full image. However, I expected that a segmentation
model would scale to arbitrary sized images, because it analyses the image in patches. Have I misunderstood something? Or is the stitching of the patches also carried out in the GPU?
To Reproduce Steps to reproduce the behavior:
- Get hold of a medium sized dataset with ground truth labels. Put them in a folder structure as expected by monai. Hold back the ground truth labels for at least one for step 6.
- Make a copy of the
radiology/lib/config/segmentation.py
file (e.g.segmentation_custom.py
) and modify the foreground classes androi_size
. - Run the monailabel app:
monailabel start_server --app radiology --studies relative/path/to/images --conf models segmentation_custom --conf use_pretrained_model false
- In Slicer, connect to the server and click
Train
. - If you have enough CPU RAM and training completes, click
Next Sample
to get an unlabelled image and thenRun
to automatically generate labels.
Expected behavior I expected to be able to train a network and run inference on a dataset with an arbitrary number of arbitrarily sized images.
I’ve used 128x128x128 patches with nnunet, and been able to run inference on GPUs with only 4GB of VRAM. I’m surprised that an 8GB GPU gets an OOM when trying to run the segmentation network with 64x64x64 patches.
8GB of GPU memory was enough to train the network, so I assumed it would also be enough to run inference.
Screenshots N/A
Environment
Ensuring you use the relevant python executable, please paste the output of:
python -c 'import monai; monai.config.print_debug_info()'
================================
Printing MONAI config...
================================
MONAI version: 1.0.1
Numpy version: 1.23.4
Pytorch version: 1.13.0+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: 8271a193229fe4437026185e218d5b06f7c8ce69
MONAI __file__: /home/chris/Software/monai/venv/lib/python3.8/site-packages/monai/__init__.py
Optional dependencies:
Pytorch Ignite version: 0.4.10
Nibabel version: 4.0.2
scikit-image version: 0.19.3
Pillow version: 9.3.0
Tensorboard version: 2.11.0
gdown version: 4.5.3
TorchVision version: 0.14.0+cu117
tqdm version: 4.64.1
lmdb version: 1.3.0
psutil version: 5.9.4
pandas version: NOT INSTALLED or UNKNOWN VERSION.
einops version: 0.6.0
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.
pynrrd version: 0.4.3
For details about installing the optional dependencies, please visit:
https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 20.04.5 LTS
Platform: Linux-5.14.0-1054-oem-x86_64-with-glibc2.29
Processor: x86_64
Machine: x86_64
Python version: 3.8.10
Process name: python
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: []
Num physical CPUs: 6
Num logical CPUs: 12
Num usable CPUs: 12
CPU usage (%): [16.5, 22.2, 15.4, 25.0, 20.9, 82.1, 13.4, 10.5, 12.3, 12.3, 13.9, 15.2]
CPU freq. (MHz): 1579
Load avg. in last 1, 5, 15 mins (%): [11.6, 10.4, 26.0]
Disk usage (%): 81.0
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 31.0
Available memory (GB): 28.3
Used memory (GB): 2.2
================================
Printing GPU config...
================================
Num GPUs: 1
Has CUDA: True
CUDA version: 11.7
cuDNN enabled: True
cuDNN version: 8500
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
GPU 0 Name: NVIDIA GeForce GTX 1080
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 20
GPU 0 Total memory (GB): 7.9
GPU 0 CUDA capability (maj.min): 6.1
Additional context N/A
Issue Analytics
- State:
- Created 10 months ago
- Comments:8
Top GitHub Comments
Ah right, I probably should have thought to look more closely at the error messages. One line above the OOM message was this line:
I managed to follow that back to the
inferer()
method inradiology/lib/infers/segmentation.py
, and added an argument to the call toSlidingWindowInferer(roi_size=self.roi_size, device=torch.device('cpu'))
.Unfortunately, that only postponed the OOM error until the post transforms. Again, following your advice I was able to read through the stack trace to discover that it was the
EnsureType
type conversion which was trying to load the full image back into GPU memory. I was able to modify that line, and now it runs 😄Thanks for your help!! This lets me run inference, and with my swap workaround I can run training too. Is it worth considering turning this into a feature request for more defensive programming? Perhaps using
torch.device('cpu')
for these operations unless the user explicitly enables GPU, or if the image size is guaranteed to fit in GPU memory? For nnunet, there is a command line argument--all-in-gpu
that serves this purpose. Having it disabled by default removes one potential source of problems.you can calculate the dump size of PersistentDataset… you might notice a difference only if the dump is reasonable… persistence cached saved into your model/xyz/train_xy folder… cache or .cache…
and also you can dig details to see how much is loaded into GPU vs CPU… if your pre-transform is cached after loading data into GPU… the corresponding tensor gets saved on the disk… some mem profilers can help to know a bit more.
i understand on supporting gpu vs non-gpu enforcement in some of the examples. it can be a good config… and for your segmentation_xxx model, you can do something like this…