Cache dataset slowdown/not loading for large dataset on multi-gpu
See original GitHub issueDescribe the bug I was running a cache dataset with around ~150 images on multi-gpu, no problems. However, I recently upgraded the dataset to have around ~3500 images. Now, the cacher will load a few hundred images and then slow down dramatically or even regress in progress loading all the images. This could happen at many different percentages and may be related to multi-gpu. I had problems running multiple single-gpu runs at the same time, but a single, single-gpu run seemed to work OK. Also, setting the number of workers to 1, or 128 does not change the problem, just the percentage where it happens.
To Reproduce Hard to do exactly the same as me, but try running a cache dataset with >500 images or so. I have narrowed the problem down to the init method, which stores all the images in the cache.
self.train = CacheDataset(
data=train_files,
transform=self.train_transforms,
cache_rate=1.0,
num_workers=8,
)
for example.
Expected behavior The loading to be smooth throughout and not slowdown or stop completely.
Outputs
2021-02-15 17:56:03
1%|β | 21/3552 [00:00<00:24, 143.54it/s]
2021-02-15 17:56:14
6%|ββββββ | 230/3552 [00:10<06:15, 8.84it/s]
2021-02-15 17:56:29
8%|βββββββ | 286/3552 [00:26<1:07:47, 1.25s/it]
2021-02-15 17:56:38
8%|βββββββ | 291/3552 [00:34<1:19:38, 1.47s/it]
2021-02-15 17:56:58
10%|βββββββββ | 348/3552 [00:55<04:18, 12.42it/s]
2021-02-15 17:57:13
11%|βββββββββ | 374/3552 [01:10<43:35, 1.22it/s]
2021-02-15 17:57:24
11%|βββββββββ | 382/3552 [01:20<1:14:13, 1.40s/it]
2021-02-15 17:57:32
11%|ββββββββββ | 400/3552 [01:28<32:54, 1.60it/s]
2021-02-15 17:58:05
12%|ββββββββββ | 424/3552 [02:01<2:21:03, 2.71s/it]
2021-02-15 17:58:41
12%|ββββββββββ | 427/3552 [02:37<4:47:51, 5.53s/it]
2021-02-15 17:59:30
12%|βββββββββββ | 430/3552 [03:27<7:41:56, 8.88s/it]
2021-02-15 17:59:31
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2021-02-15 17:59:42
12%|βββββββββββ | 440/3552 [03:39<3:16:16, 3.78s/it]
2021-02-15 18:00:00
13%|βββββββββββ | 446/3552 [03:56<3:13:10, 3.73s/it]
2021-02-15 18:02:58
13%|βββββββββββ | 447/3552 [06:54<26:35:29, 30.83s/it]
2021-02-15 18:03:14
13%|βββββββββββ | 453/3552 [07:10<11:10:22, 12.98s/it]
You can see by looking at the timestamps that the loading speed drops dramatically and basically becomes 0 after 450 images or so.
Environment
================================ Printing MONAI configβ¦
MONAI version: 0.4.0 Numpy version: 1.19.2 Pytorch version: 1.7.1 MONAI flags: HAS_EXT = False, USE_COMPILED = False MONAI rev id: 0563a4467fa602feca92d91c7f47261868d171a1
Optional dependencies: Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION. Nibabel version: 3.2.1 scikit-image version: NOT INSTALLED or UNKNOWN VERSION. Pillow version: 8.1.0 Tensorboard version: 2.3.0 gdown version: NOT INSTALLED or UNKNOWN VERSION. TorchVision version: 0.8.2 ITK version: NOT INSTALLED or UNKNOWN VERSION. tqdm version: 4.56.0 lmdb version: 1.1.1 psutil version: 5.8.0
For details about installing the optional dependencies, please visit: https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
================================ Printing system configβ¦
System: Linux Linux version: Ubuntu 20.04.1 LTS Platform: Linux-5.8.0-40-generic-x86_64-with-debian-bullseye-sid Processor: x86_64 Machine: x86_64 Python version: 3.7.9 Process name: python Command: [βpythonβ, β-cβ, βimport monai; monai.config.print_debug_info()β] Open files: [] Num physical CPUs: 64 Num logical CPUs: 128 Num usable CPUs: 128 CPU usage (%): [20.0, 0.0, 0.0, 10.0, 0.0, 11.1, 33.3, 20.0, 20.0, 0.0, 10.0, 0.0, 0.0, 10.0, 100.0, 100.0, 0.0, 11.1, 11.1, 0.0, 11.1, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 100.0, 0.0, 0.0, 0.0, 11.1, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0, 11.1, 0.0, 11.1, 0.0, 11.1, 11.1, 20.0, 0.0, 11.1, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 0.0, 11.1, 11.1, 11.1, 0.0, 0.0, 0.0, 0.0, 11.1, 0.0, 0.0, 0.0, 0.0, 0.0, 22.2, 11.1, 0.0, 0.0, 11.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 20.0, 0.0, 0.0, 11.1, 0.0, 30.0, 33.3, 0.0, 10.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 10.0, 10.0, 10.0, 0.0, 0.0, 11.1, 11.1, 0.0, 10.0, 11.1, 11.1, 11.1, 0.0, 100.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0] CPU freq. (MHz): 2360 Load avg. in last 1, 5, 15 mins (%): [24.8, 20.4, 20.4] Disk usage (%): 18.1 Avg. sensor temp. (Celsius): UNKNOWN for given OS Total physical memory (GB): 251.6 Available memory (GB): 153.4 Used memory (GB): 91.8
================================ Printing GPU configβ¦
Num GPUs: 4 Has CUDA: True CUDA version: 11.0 cuDNN enabled: True cuDNN version: 8005 Current device: 0 Library compiled for CUDA architectures: [βsm_37β, βsm_50β, βsm_60β, βsm_61β, βsm_70β, βsm_75β, βsm_80β, βcompute_37β] Info for GPU: 3 Name: Quadro RTX 8000 Is integrated: False Is multi GPU board: False Multi processor count: 72 Total memory (GB): 47.5 Cached memory (GB): 0.0 Allocated memory (GB): 0.0 CUDA capability (maj.min): 7.5
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
It appears that the problem has gone away (?) after a system reboot, but I added the dataset partitioning for good measure. Thanks @Nic-Ma. On a related note, it appears that the cache loading is sometimes very fast, but other times very slow. Any reasons why this might be?
I was also having an issue with GPUs deadlocking/hanging with 100% usage at seemingly random times part way through training. Maybe the partitioning will also help there. Anyways, as the problem has not re-appeared I will close the issue. Hopefully it does not come back π€
Hi @ndalton12 ,
For distributed data parallel, to avoid duplicated caching for every rank, we usually partition dataset before caching. You can check this example for more details: https://github.com/Project-MONAI/tutorials/blob/master/acceleration/distributed_training/unet_training_smartcache.py#L120 Note that, it will avoid duplicated caching in memory, but it will not
global shuffle
for every epoch, every rank only shuffles its own partition.Thanks.