question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cache dataset slowdown/not loading for large dataset on multi-gpu

See original GitHub issue

Describe the bug I was running a cache dataset with around ~150 images on multi-gpu, no problems. However, I recently upgraded the dataset to have around ~3500 images. Now, the cacher will load a few hundred images and then slow down dramatically or even regress in progress loading all the images. This could happen at many different percentages and may be related to multi-gpu. I had problems running multiple single-gpu runs at the same time, but a single, single-gpu run seemed to work OK. Also, setting the number of workers to 1, or 128 does not change the problem, just the percentage where it happens.

To Reproduce Hard to do exactly the same as me, but try running a cache dataset with >500 images or so. I have narrowed the problem down to the init method, which stores all the images in the cache.

            self.train = CacheDataset(
                data=train_files,
                transform=self.train_transforms,
                cache_rate=1.0,
                num_workers=8,
            )

for example.

Expected behavior The loading to be smooth throughout and not slowdown or stop completely.

Outputs

2021-02-15 17:56:03
1%|β–Œ | 21/3552 [00:00<00:24, 143.54it/s]
2021-02-15 17:56:14
6%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 230/3552 [00:10<06:15, 8.84it/s]
2021-02-15 17:56:29
8%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 286/3552 [00:26<1:07:47, 1.25s/it]
2021-02-15 17:56:38
8%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 291/3552 [00:34<1:19:38, 1.47s/it]
2021-02-15 17:56:58
10%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 348/3552 [00:55<04:18, 12.42it/s]
2021-02-15 17:57:13
11%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 374/3552 [01:10<43:35, 1.22it/s]
2021-02-15 17:57:24
11%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 382/3552 [01:20<1:14:13, 1.40s/it]
2021-02-15 17:57:32
11%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 400/3552 [01:28<32:54, 1.60it/s]
2021-02-15 17:58:05
12%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 424/3552 [02:01<2:21:03, 2.71s/it]
2021-02-15 17:58:41
12%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 427/3552 [02:37<4:47:51, 5.53s/it]
2021-02-15 17:59:30
12%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 430/3552 [03:27<7:41:56, 8.88s/it]
2021-02-15 17:59:31
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2021-02-15 17:59:42
12%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 440/3552 [03:39<3:16:16, 3.78s/it]
2021-02-15 18:00:00
13%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 446/3552 [03:56<3:13:10, 3.73s/it]
2021-02-15 18:02:58
13%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 447/3552 [06:54<26:35:29, 30.83s/it]
2021-02-15 18:03:14
13%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 453/3552 [07:10<11:10:22, 12.98s/it]

You can see by looking at the timestamps that the loading speed drops dramatically and basically becomes 0 after 450 images or so.

Environment

================================ Printing MONAI config…

MONAI version: 0.4.0 Numpy version: 1.19.2 Pytorch version: 1.7.1 MONAI flags: HAS_EXT = False, USE_COMPILED = False MONAI rev id: 0563a4467fa602feca92d91c7f47261868d171a1

Optional dependencies: Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION. Nibabel version: 3.2.1 scikit-image version: NOT INSTALLED or UNKNOWN VERSION. Pillow version: 8.1.0 Tensorboard version: 2.3.0 gdown version: NOT INSTALLED or UNKNOWN VERSION. TorchVision version: 0.8.2 ITK version: NOT INSTALLED or UNKNOWN VERSION. tqdm version: 4.56.0 lmdb version: 1.1.1 psutil version: 5.8.0

For details about installing the optional dependencies, please visit: https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================ Printing system config…

System: Linux Linux version: Ubuntu 20.04.1 LTS Platform: Linux-5.8.0-40-generic-x86_64-with-debian-bullseye-sid Processor: x86_64 Machine: x86_64 Python version: 3.7.9 Process name: python Command: [β€˜python’, β€˜-c’, β€˜import monai; monai.config.print_debug_info()’] Open files: [] Num physical CPUs: 64 Num logical CPUs: 128 Num usable CPUs: 128 CPU usage (%): [20.0, 0.0, 0.0, 10.0, 0.0, 11.1, 33.3, 20.0, 20.0, 0.0, 10.0, 0.0, 0.0, 10.0, 100.0, 100.0, 0.0, 11.1, 11.1, 0.0, 11.1, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 100.0, 0.0, 0.0, 0.0, 11.1, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0, 11.1, 0.0, 11.1, 0.0, 11.1, 11.1, 20.0, 0.0, 11.1, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 0.0, 11.1, 11.1, 11.1, 0.0, 0.0, 0.0, 0.0, 11.1, 0.0, 0.0, 0.0, 0.0, 0.0, 22.2, 11.1, 0.0, 0.0, 11.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 20.0, 0.0, 0.0, 11.1, 0.0, 30.0, 33.3, 0.0, 10.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 10.0, 10.0, 10.0, 0.0, 0.0, 11.1, 11.1, 0.0, 10.0, 11.1, 11.1, 11.1, 0.0, 100.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0] CPU freq. (MHz): 2360 Load avg. in last 1, 5, 15 mins (%): [24.8, 20.4, 20.4] Disk usage (%): 18.1 Avg. sensor temp. (Celsius): UNKNOWN for given OS Total physical memory (GB): 251.6 Available memory (GB): 153.4 Used memory (GB): 91.8

================================ Printing GPU config…

Num GPUs: 4 Has CUDA: True CUDA version: 11.0 cuDNN enabled: True cuDNN version: 8005 Current device: 0 Library compiled for CUDA architectures: [β€˜sm_37’, β€˜sm_50’, β€˜sm_60’, β€˜sm_61’, β€˜sm_70’, β€˜sm_75’, β€˜sm_80’, β€˜compute_37’] Info for GPU: 3 Name: Quadro RTX 8000 Is integrated: False Is multi GPU board: False Multi processor count: 72 Total memory (GB): 47.5 Cached memory (GB): 0.0 Allocated memory (GB): 0.0 CUDA capability (maj.min): 7.5

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ndalton12commented, Feb 18, 2021

It appears that the problem has gone away (?) after a system reboot, but I added the dataset partitioning for good measure. Thanks @Nic-Ma. On a related note, it appears that the cache loading is sometimes very fast, but other times very slow. Any reasons why this might be?

I was also having an issue with GPUs deadlocking/hanging with 100% usage at seemingly random times part way through training. Maybe the partitioning will also help there. Anyways, as the problem has not re-appeared I will close the issue. Hopefully it does not come back 🀞

0reactions
Nic-Macommented, Feb 18, 2021

Hi @ndalton12 ,

For distributed data parallel, to avoid duplicated caching for every rank, we usually partition dataset before caching. You can check this example for more details: https://github.com/Project-MONAI/tutorials/blob/master/acceleration/distributed_training/unet_training_smartcache.py#L120 Note that, it will avoid duplicated caching in memory, but it will not global shuffle for every epoch, every rank only shuffles its own partition.

Thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Slow processing with map when using deepspeed or fairscale
I apply the tokenizer to my custom dataset using the datasets.Dataset.map() method as done in the run_mlm.py example.
Read more >
Training fast with small dataset, slow with large dataset - vision
I am training a GAN drawing samples from LMDBs. I have multiple LMDBs - I create a dataset for each and concatenate them...
Read more >
python - Tensorflow tf.data.Dataset.cache seems do not take ...
Just a small observation using Google Colab. According to the docs: Note: For the cache to be finalized, the input dataset must be...
Read more >
How to Reduce Training Time for a Deep Learning Model ...
Learn to create an input pipeline for images to efficiently use CPU and GPU resources to process the image dataset and reduce the...
Read more >
Speed Up Model Training - PyTorch Lightning - Read the Docs
When training on single or multiple GPU machines, Lightning offers a host of ... For debugging purposes or for dataloaders that load very...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found