Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cache dataset slowdown/not loading for large dataset on multi-gpu

See original GitHub issue

Describe the bug I was running a cache dataset with around ~150 images on multi-gpu, no problems. However, I recently upgraded the dataset to have around ~3500 images. Now, the cacher will load a few hundred images and then slow down dramatically or even regress in progress loading all the images. This could happen at many different percentages and may be related to multi-gpu. I had problems running multiple single-gpu runs at the same time, but a single, single-gpu run seemed to work OK. Also, setting the number of workers to 1, or 128 does not change the problem, just the percentage where it happens.

To Reproduce Hard to do exactly the same as me, but try running a cache dataset with >500 images or so. I have narrowed the problem down to the init method, which stores all the images in the cache.

            self.train = CacheDataset(
                data=train_files,
                transform=self.train_transforms,
                cache_rate=1.0,
                num_workers=8,
            )

for example.

Expected behavior The loading to be smooth throughout and not slowdown or stop completely.

Outputs

2021-02-15 17:56:03
1%|▌ | 21/3552 [00:00<00:24, 143.54it/s]
2021-02-15 17:56:14
6%|█████▌ | 230/3552 [00:10<06:15, 8.84it/s]
2021-02-15 17:56:29
8%|██████▊ | 286/3552 [00:26<1:07:47, 1.25s/it]
2021-02-15 17:56:38
8%|██████▉ | 291/3552 [00:34<1:19:38, 1.47s/it]
2021-02-15 17:56:58
10%|████████▍ | 348/3552 [00:55<04:18, 12.42it/s]
2021-02-15 17:57:13
11%|█████████ | 374/3552 [01:10<43:35, 1.22it/s]
2021-02-15 17:57:24
11%|█████████ | 382/3552 [01:20<1:14:13, 1.40s/it]
2021-02-15 17:57:32
11%|█████████▋ | 400/3552 [01:28<32:54, 1.60it/s]
2021-02-15 17:58:05
12%|██████████ | 424/3552 [02:01<2:21:03, 2.71s/it]
2021-02-15 17:58:41
12%|██████████ | 427/3552 [02:37<4:47:51, 5.53s/it]
2021-02-15 17:59:30
12%|██████████▏ | 430/3552 [03:27<7:41:56, 8.88s/it]
2021-02-15 17:59:31
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2021-02-15 17:59:42
12%|██████████▍ | 440/3552 [03:39<3:16:16, 3.78s/it]
2021-02-15 18:00:00
13%|██████████▌ | 446/3552 [03:56<3:13:10, 3.73s/it]
2021-02-15 18:02:58
13%|██████████▍ | 447/3552 [06:54<26:35:29, 30.83s/it]
2021-02-15 18:03:14
13%|██████████▌ | 453/3552 [07:10<11:10:22, 12.98s/it]

You can see by looking at the timestamps that the loading speed drops dramatically and basically becomes 0 after 450 images or so.

Environment

================================ Printing MONAI config…

MONAI version: 0.4.0 Numpy version: 1.19.2 Pytorch version: 1.7.1 MONAI flags: HAS_EXT = False, USE_COMPILED = False MONAI rev id: 0563a4467fa602feca92d91c7f47261868d171a1

Optional dependencies: Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION. Nibabel version: 3.2.1 scikit-image version: NOT INSTALLED or UNKNOWN VERSION. Pillow version: 8.1.0 Tensorboard version: 2.3.0 gdown version: NOT INSTALLED or UNKNOWN VERSION. TorchVision version: 0.8.2 ITK version: NOT INSTALLED or UNKNOWN VERSION. tqdm version: 4.56.0 lmdb version: 1.1.1 psutil version: 5.8.0

For details about installing the optional dependencies, please visit: https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================ Printing system config…

System: Linux Linux version: Ubuntu 20.04.1 LTS Platform: Linux-5.8.0-40-generic-x86_64-with-debian-bullseye-sid Processor: x86_64 Machine: x86_64 Python version: 3.7.9 Process name: python Command: [‘python’, ‘-c’, ‘import monai; monai.config.print_debug_info()’] Open files: [] Num physical CPUs: 64 Num logical CPUs: 128 Num usable CPUs: 128 CPU usage (%): [20.0, 0.0, 0.0, 10.0, 0.0, 11.1, 33.3, 20.0, 20.0, 0.0, 10.0, 0.0, 0.0, 10.0, 100.0, 100.0, 0.0, 11.1, 11.1, 0.0, 11.1, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 100.0, 0.0, 0.0, 0.0, 11.1, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0, 11.1, 0.0, 11.1, 0.0, 11.1, 11.1, 20.0, 0.0, 11.1, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 0.0, 11.1, 11.1, 11.1, 0.0, 0.0, 0.0, 0.0, 11.1, 0.0, 0.0, 0.0, 0.0, 0.0, 22.2, 11.1, 0.0, 0.0, 11.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 20.0, 0.0, 0.0, 11.1, 0.0, 30.0, 33.3, 0.0, 10.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 10.0, 10.0, 10.0, 0.0, 0.0, 11.1, 11.1, 0.0, 10.0, 11.1, 11.1, 11.1, 0.0, 100.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0] CPU freq. (MHz): 2360 Load avg. in last 1, 5, 15 mins (%): [24.8, 20.4, 20.4] Disk usage (%): 18.1 Avg. sensor temp. (Celsius): UNKNOWN for given OS Total physical memory (GB): 251.6 Available memory (GB): 153.4 Used memory (GB): 91.8

================================ Printing GPU config…

Num GPUs: 4 Has CUDA: True CUDA version: 11.0 cuDNN enabled: True cuDNN version: 8005 Current device: 0 Library compiled for CUDA architectures: [‘sm_37’, ‘sm_50’, ‘sm_60’, ‘sm_61’, ‘sm_70’, ‘sm_75’, ‘sm_80’, ‘compute_37’] Info for GPU: 3 Name: Quadro RTX 8000 Is integrated: False Is multi GPU board: False Multi processor count: 72 Total memory (GB): 47.5 Cached memory (GB): 0.0 Allocated memory (GB): 0.0 CUDA capability (maj.min): 7.5

Issue Analytics

State:
Created 3 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

ndalton12commented, Feb 18, 2021

It appears that the problem has gone away (?) after a system reboot, but I added the dataset partitioning for good measure. Thanks @Nic-Ma. On a related note, it appears that the cache loading is sometimes very fast, but other times very slow. Any reasons why this might be?

I was also having an issue with GPUs deadlocking/hanging with 100% usage at seemingly random times part way through training. Maybe the partitioning will also help there. Anyways, as the problem has not re-appeared I will close the issue. Hopefully it does not come back 🤞

0reactions

Nic-Macommented, Feb 18, 2021

Hi @ndalton12 ,

For distributed data parallel, to avoid duplicated caching for every rank, we usually partition dataset before caching. You can check this example for more details: https://github.com/Project-MONAI/tutorials/blob/master/acceleration/distributed_training/unet_training_smartcache.py#L120 Note that, it will avoid duplicated caching in memory, but it will not global shuffle for every epoch, every rank only shuffles its own partition.

Thanks.