epoch durations significantly increase over time when using cached datasets?!
See original GitHub issueDescribe the bug On two servers I observe substantially increasing epoch durations when training with cached datasets. The effects are way more pronounced on server A (64cores) with no swap than on Server B (14cores) with now 144gb of swap memory.
The trainings start smooth and then become slower after a while. Unfortunately, I can atm only access server B now to provide logs. This should change in the next days.
I tried several recent pytorch and Monai versions and building pytorch from source. Furthermore I tried using:
torch.cuda.empty_cache()
after every epoch.
Furthermore I tried different values from the amount of workers from 0 to max(cpu_cores) * 2 for each machine.
On both servers I notice increased “red kernels” indicating wait for system processes in htop when the servers slow down: server A:
server B:
Also attached a screenshot from server B with the increased epoch durations (the spike every few epochs is from an extended validation epoch). One can see how the epoch times go up from ~350s in the beginning to over 2000s. Notice, this effect is way more pronounced on server A (orders of magnitude), I hope I can present some timings from training on server A without special validation epochs soon. Shortly before epoch 80 I increased the swap memory file from 16 to 144gb, which seems to have helped a bit.
Interestingly this effect seems to be only there for big datasets. For smaller datasets everything seems to be okay, one can see slightly decreasing epoch times, which I attribute to the caching?
Notably, nobody else was using the servers for my tests.
To Reproduce for example run: https://github.com/Project-MONAI/tutorials/blob/master/3d_segmentation/challenge_baseline/run_net.py
other cached datasets I tried: PersistentDataset CacheDataset SmartCacheDataset¶
Expected behavior epoch times should remain constant like for the training with the small dataset.
Environment Ensuring you use the relevant python executable, please paste the output of:
python -c 'import monai; monai.config.print_debug_info()'
================================
Printing MONAI config...
================================
MONAI version: 0.6.0
Numpy version: 1.21.2
Pytorch version: 1.9.1+cu111
MONAI flags: HAS_EXT = False, USE_COMPILED = False
MONAI rev id: 0ad9e73639e30f4f1af5a1f4a45da9cb09930179
Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 3.2.1
scikit-image version: 0.18.3
Pillow version: 8.3.1
Tensorboard version: 2.6.0
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: 0.10.1+cu111
ITK version: NOT INSTALLED or UNKNOWN VERSION.
tqdm version: 4.62.1
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: 5.8.0
pandas version: 1.3.3
einops version: NOT INSTALLED or UNKNOWN VERSION.
For details about installing the optional dependencies, please visit:
https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 20.04.3 LTS
Platform: Linux-5.4.0-88-generic-x86_64-with-glibc2.31
Processor: x86_64
Machine: x86_64
Python version: 3.9.7
Process name: python
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: [popenfile(path='/home/florian/.vscode-server/bin/ee8c7def80afc00dd6e593ef12f37756d8f504ea/vscode-remote-lock.florian.ee8c7def80afc00dd6e593ef12f37756d8f504ea', fd=99, position=0, mode='w', flags=32769)]
Num physical CPUs: 14
Num logical CPUs: 28
Num usable CPUs: 28
CPU usage (%): [11.5, 60.4, 100.0, 9.8, 100.0, 24.6, 100.0, 29.0, 8.8, 100.0, 100.0, 100.0, 100.0, 100.0, 11.5, 10.5, 8.7, 9.7, 7.8, 7.8, 8.7, 8.7, 6.9, 8.7, 8.7, 7.8, 8.7, 8.7]
CPU freq. (MHz): 2874
Load avg. in last 1, 5, 15 mins (%): [30.8, 27.3, 27.1]
Disk usage (%): 52.6
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 125.5
Available memory (GB): 14.4
Used memory (GB): 44.3
================================
Printing GPU config...
================================
Num GPUs: 2
Has CUDA: True
CUDA version: 11.1
cuDNN enabled: True
cuDNN version: 8005
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
GPU 0 Name: Quadro RTX 8000
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 72
GPU 0 Total memory (GB): 47.5
GPU 0 CUDA capability (maj.min): 7.5
GPU 1 Name: Quadro RTX 8000
GPU 1 Is integrated: False
GPU 1 Is multi GPU board: False
GPU 1 Multi processor count: 72
GPU 1 Total memory (GB): 47.5
GPU 1 CUDA capability (maj.min): 7.5
Issue Analytics
- State:
- Created 2 years ago
- Comments:15 (15 by maintainers)
Top GitHub Comments
The following code:
Gave this as output:
Which seems sufficiently consistent to me.
Do you think you’d be able to use it as starting point and gradually increase complexity/image sizes until you replicate your problem?
Thanks, I used
replace_rate=0.25
, will do your suggested experiment and report back.