question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

epoch durations significantly increase over time when using cached datasets?!

See original GitHub issue

Describe the bug On two servers I observe substantially increasing epoch durations when training with cached datasets. The effects are way more pronounced on server A (64cores) with no swap than on Server B (14cores) with now 144gb of swap memory.

The trainings start smooth and then become slower after a while. Unfortunately, I can atm only access server B now to provide logs. This should change in the next days.

I tried several recent pytorch and Monai versions and building pytorch from source. Furthermore I tried using: torch.cuda.empty_cache() after every epoch.

Furthermore I tried different values from the amount of workers from 0 to max(cpu_cores) * 2 for each machine.

On both servers I notice increased “red kernels” indicating wait for system processes in htop when the servers slow down: server A: Screenshot 2021-10-12 at 16 59 52

server B: Screenshot 2021-10-12 at 16 01 31

Also attached a screenshot from server B with the increased epoch durations (the spike every few epochs is from an extended validation epoch). One can see how the epoch times go up from ~350s in the beginning to over 2000s. Notice, this effect is way more pronounced on server A (orders of magnitude), I hope I can present some timings from training on server A without special validation epochs soon. Screenshot 2021-10-12 at 16 44 39 Shortly before epoch 80 I increased the swap memory file from 16 to 144gb, which seems to have helped a bit.

Interestingly this effect seems to be only there for big datasets. For smaller datasets everything seems to be okay, one can see slightly decreasing epoch times, which I attribute to the caching? Screenshot 2021-10-12 at 16 43 33

Notably, nobody else was using the servers for my tests.

To Reproduce for example run: https://github.com/Project-MONAI/tutorials/blob/master/3d_segmentation/challenge_baseline/run_net.py

other cached datasets I tried: PersistentDataset CacheDataset SmartCacheDataset¶

Expected behavior epoch times should remain constant like for the training with the small dataset.

Environment Ensuring you use the relevant python executable, please paste the output of:

python -c 'import monai; monai.config.print_debug_info()'
================================
Printing MONAI config...
================================
MONAI version: 0.6.0
Numpy version: 1.21.2
Pytorch version: 1.9.1+cu111
MONAI flags: HAS_EXT = False, USE_COMPILED = False
MONAI rev id: 0ad9e73639e30f4f1af5a1f4a45da9cb09930179

Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 3.2.1
scikit-image version: 0.18.3
Pillow version: 8.3.1
Tensorboard version: 2.6.0
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: 0.10.1+cu111
ITK version: NOT INSTALLED or UNKNOWN VERSION.
tqdm version: 4.62.1
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: 5.8.0
pandas version: 1.3.3
einops version: NOT INSTALLED or UNKNOWN VERSION.

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies


================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 20.04.3 LTS
Platform: Linux-5.4.0-88-generic-x86_64-with-glibc2.31
Processor: x86_64
Machine: x86_64
Python version: 3.9.7
Process name: python
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: [popenfile(path='/home/florian/.vscode-server/bin/ee8c7def80afc00dd6e593ef12f37756d8f504ea/vscode-remote-lock.florian.ee8c7def80afc00dd6e593ef12f37756d8f504ea', fd=99, position=0, mode='w', flags=32769)]
Num physical CPUs: 14
Num logical CPUs: 28
Num usable CPUs: 28
CPU usage (%): [11.5, 60.4, 100.0, 9.8, 100.0, 24.6, 100.0, 29.0, 8.8, 100.0, 100.0, 100.0, 100.0, 100.0, 11.5, 10.5, 8.7, 9.7, 7.8, 7.8, 8.7, 8.7, 6.9, 8.7, 8.7, 7.8, 8.7, 8.7]
CPU freq. (MHz): 2874
Load avg. in last 1, 5, 15 mins (%): [30.8, 27.3, 27.1]
Disk usage (%): 52.6
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 125.5
Available memory (GB): 14.4
Used memory (GB): 44.3

================================
Printing GPU config...
================================
Num GPUs: 2
Has CUDA: True
CUDA version: 11.1
cuDNN enabled: True
cuDNN version: 8005
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
GPU 0 Name: Quadro RTX 8000
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 72
GPU 0 Total memory (GB): 47.5
GPU 0 CUDA capability (maj.min): 7.5
GPU 1 Name: Quadro RTX 8000
GPU 1 Is integrated: False
GPU 1 Is multi GPU board: False
GPU 1 Multi processor count: 72
GPU 1 Total memory (GB): 47.5
GPU 1 CUDA capability (maj.min): 7.5

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
rijobrocommented, Oct 25, 2021

The following code:

import time

from monai.data import CacheDataset, DataLoader
from tests.utils import create_test_image_3d, make_nifti_image
from monai.transforms import Compose, LoadImaged, AddChanneld, ToTensord

def main():
    im_shape = 100
    num_ims = 500
    image, label = (make_nifti_image(i) for i in create_test_image_3d(im_shape, im_shape, im_shape, noise_max=100))
    data = [{"image": image, "label": label} for _ in range(num_ims)]

    transforms = Compose([
        LoadImaged(keys=["image", "label"]),
        AddChanneld(keys=["image"]),
        ToTensord(keys=["image", "label"]),
    ])

    ds = CacheDataset(data, transforms)
    dl = DataLoader(ds, num_workers=5)
    num_epochs = 20
    times = []
    for i in range(num_epochs):
        t0 = time.time()
        for _ in dl:
            pass
        times.append(time.time() - t0)
        print(f"time for epoch {i} = {times[-1]}")


if __name__ == "__main__":
    main()

Gave this as output:

time for epoch 0 = 14.29433560371399
time for epoch 1 = 14.047232866287231
time for epoch 2 = 14.039271354675293
time for epoch 3 = 14.382421493530273
time for epoch 4 = 14.286617517471313
time for epoch 5 = 14.642267942428589
time for epoch 6 = 14.51481580734253
time for epoch 7 = 14.36440372467041
time for epoch 8 = 14.417043685913086
time for epoch 9 = 14.654301881790161
time for epoch 10 = 14.619521141052246
time for epoch 11 = 14.621773719787598
time for epoch 12 = 14.666800260543823
time for epoch 13 = 14.868067979812622
time for epoch 14 = 14.978628873825073
time for epoch 15 = 14.582119941711426
time for epoch 16 = 14.34918737411499
time for epoch 17 = 14.598304033279419
time for epoch 18 = 14.281646251678467
time for epoch 19 = 14.260899543762207

Which seems sufficiently consistent to me.

Do you think you’d be able to use it as starting point and gradually increase complexity/image sizes until you replicate your problem?

1reaction
neuronflowcommented, Oct 13, 2021

Thanks, I used replace_rate=0.25, will do your suggested experiment and report back.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why does the training time increase so much during the first ...
When I run the model (efficientnet), I notice a strange behavior during the first epoch, it increases sharply and goes to a constant...
Read more >
Access historical data using time travel | BigQuery
You can set the duration of the time travel window, from a minimum of two days to a maximum of seven days. Seven...
Read more >
Performance tips | TensorFlow Datasets
As those datasets fit in memory, it is possible to significantly improve the performance by caching or pre-loading the dataset.
Read more >
Tutorial 7 - Neuromorphic Datasets with Tonic + snnTorch
The original data is stored in a format that is slow to read. To speed up dataloading, we can make use of disk...
Read more >
Analyzing and Mitigating Data Stalls in DNN Training - arXiv
Fetch stalls are common if the dataset is not fully cached in memory. Figure 3 shows the percentage of per epoch time spent...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found