question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Rethinking the Queue class to get full GPU utilization

See original GitHub issue

This is a really nice framework, however a serious issue for me is (the lack of) GPU utilization. This is an issue even with only a simple ZNormalization and a left right flip of the data as augmentation. This results in the following GPU utilization: Screenshot from 2020-12-22 12-12-31

This is a training with 5 subjects and sampling 40 patches per volume and batch size 8. After every 25 iterations (which means 200 patches) there is a gap and the GPU utilization is 0.

What I did is building a custom Queue class in which I tried to get full GPU utilization. The result is as follows: Screenshot from 2020-12-22 12-05-10 As you can see this has a GPU utilization of ~100% without gaps (0 utilization is before the start).

I tried to use the existing PyTorch data functionalities as much as possible. The BufferedShuffleDataset is something that is not yet in the release, but seems to be in the next release. The idea behind it is the same as for the shuffle() in Tensorflow data.

Here is the code that I made:

import random
from itertools import islice
from typing import Iterator, List

import torch
torch.multiprocessing.set_sharing_strategy('file_system')
from torch.utils.data import DataLoader, IterableDataset
from torch.utils.data.dataset import T_co
from torchio.data import PatchSampler



# https://github.com/pytorch/pytorch/commit/96540e918c4ca3f0a03866b9d281c34c65bd76a4#diff-425b66e1ff01d191679c386258a7156dfb5aacd64a8e0947b24fbdebcbee8529
class BufferedShuffleDataset(IterableDataset[T_co]):
    r"""Dataset shuffled from the original dataset.
    This class is useful to shuffle an existing instance of an IterableDataset.
    The buffer with `buffer_size` is filled with the items from the dataset first. Then,
    each item will be yielded from the buffer by reservoir sampling via iterator.
    `buffer_size` is required to be larger than 0. For `buffer_size == 1`, the
    dataset is not shuffled. In order to fully shuffle the whole dataset, `buffer_size`
    is required to be greater than or equal to the size of dataset.
    When it is used with :class:`~torch.utils.data.DataLoader`, each item in the
    dataset will be yielded from the :class:`~torch.utils.data.DataLoader` iterator.
    And, the method to set up a random seed is different based on :attr:`num_workers`.
    For single-process mode (:attr:`num_workers == 0`), the random seed is required to
    be set before the :class:`~torch.utils.data.DataLoader` in the main process.
        >>> ds = BufferedShuffleDataset(dataset)
        >>> random.seed(...)
        >>> print(list(torch.utils.data.DataLoader(ds, num_workers=0)))
    For multi-process mode (:attr:`num_workers > 0`), the random seed is set by a callable
    function in each worker.
        >>> ds = BufferedShuffleDataset(dataset)
        >>> def init_fn(worker_id):
        ...     random.seed(...)
        >>> print(list(torch.utils.data.DataLoader(ds, ..., num_workers=n, worker_init_fn=init_fn)))
    Arguments:
        dataset (IterableDataset): The original IterableDataset.
        buffer_size (int): The buffer size for shuffling.
    """
    dataset: IterableDataset[T_co]
    buffer_size: int

    def __init__(self, dataset: IterableDataset[T_co], buffer_size: int) -> None:
        super(BufferedShuffleDataset, self).__init__()
        assert buffer_size > 0, "buffer_size should be larger than 0"
        self.dataset = dataset
        self.buffer_size = buffer_size

    def __iter__(self) -> Iterator[T_co]:
        buf: List[T_co] = []
        for x in self.dataset:
            if len(buf) == self.buffer_size:
                idx = random.randint(0, self.buffer_size - 1)
                yield buf[idx]
                buf[idx] = x
            else:
                buf.append(x)
        random.shuffle(buf)
        while buf:
            yield buf.pop()


class PatchesDataset(IterableDataset):
    def __init__(self, subjects_dataset, sampler, samples_per_volume):
        self.subjects_dataset = subjects_dataset
        self.sampler = sampler
        self.samples_per_volume = samples_per_volume

    def __iter__(self):
        while True:
            idx = random.randint(0, len(self.subjects_dataset) - 1)
            sample = self.subjects_dataset[idx]
            iterable = self.sampler(sample)
            patches = list(islice(iterable, self.samples_per_volume))

            yield patches


class Queue(IterableDataset):
    def __init__(
            self,
            subjects_dataset: PatchesDataset,
            max_length: int,
            samples_per_volume: int,
            sampler: PatchSampler,
            num_workers: int = 0,
            shuffle_subjects: bool = True,
            shuffle_patches: bool = True,
            verbose: bool = False,
    ):
        self.dataset = PatchesDataset(subjects_dataset, sampler, samples_per_volume)
        self.max_length = max_length

        self.loader = DataLoader(self.dataset,
                                 batch_size=None,
                                 num_workers=num_workers,
                                 persistent_workers=True)

        self.buffer = []

    def __iter__(self):
        # Basically this is an unbatch operation
        for patches_list in self.loader:
            for patch in patches_list:
                yield patch

As you can see I had to do:

import torch
torch.multiprocessing.set_sharing_strategy('file_system')

because otherwise I got this error: RuntimeError: received 0 items of ancdata

This may be something in the system that I use, but it seems to be a more common thing, see: https://github.com/pytorch/pytorch/issues/973

With using this custom implementation I could use:

        # patches_queue = tio.Queue(
        #     self.dataset,
        #     max_length=self.queue_length,
        #     samples_per_volume=self.samples_per_volume,
        #     sampler=sampler,
        #     num_workers=self.num_workers,
        #     verbose=False
        # )

        # use the custom queue instead of the default one
        queue = Queue(self.dataset,
                      max_length=self.queue_length,
                      samples_per_volume=self.samples_per_volume,
                      sampler=sampler,
                      num_workers=self.num_workers,
                      verbose=False)
        patches_queue = BufferedShuffleDataset(queue, self.queue_length)

        patches_loader = DataLoader(patches_queue, batch_size=self.batch_size)

What do you think of this? Could this replace or exist next to the existing tio.Queue?

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:3
  • Comments:17 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
dmuscommented, Apr 22, 2021

To come back to this issue. The implementation now looks like this

class BatchedPatchesDataset(IterableDataset):

    def __init__(self, subjects_datasets, weights, sampler, samples_per_volume):
        self.subjects_datasets = subjects_datasets
        self.weights = weights
        self.sampler = sampler
        self.samples_per_volume = samples_per_volume

    def __iter__(self):
        while True:
            sampled_dataset = random.choices(population=self.subjects_datasets, weights=self.weights)[0]
            idx = random.randint(0, len(sampled_dataset) - 1)
            sample = sampled_dataset[idx]
            iterable = self.sampler(sample)
            patches = list(islice(iterable, self.samples_per_volume))

            yield patches


class UnbatchDataset(IterableDataset):

    def __init__(
            self,
            dataset: Dataset,
            num_workers: int = 0,
    ):
        self.loader = DataLoader(dataset,
                                                 batch_size=None,
                                                 num_workers=num_workers)

    def __iter__(self):
        for batch in self.loader:
            yield from batch

To use it:

    # This yields a dataset with non random batches of batch_size 'samples_per_volume' 
    to_patches = BatchedPatchesDataset(subjects_datasets=[dataset],
                                                                  weights=[1], # Only relevant when sampling from multiple subject datasets
                                                                  sampler=sampler,
                                                                  samples_per_volume=samples_per_volume)
    # Unbatch the batches
    patches_unbatched = UnbatchDataset(to_patches, num_workers)

    # Shuffle to get the patches in a random order
    queue = BufferedShuffleDataset(patches_unbatched, max_queue_length)

    patches_loader = DataLoader(queue, batch_size=batch_size)

    for i, patches_batch in enumerate(patches_loader):
        inputs = patches_batch['ct'][tio.DATA].numpy()
        targets = patches_batch['labels'][tio.DATA].numpy()

For my use case this gives good gpu utilization and a big speed up. But it would be good to also know about other use cases. Shall I submit a pull request?

2reactions
dmuscommented, Dec 22, 2020

I will test and work it out a bit more and then submit a PR

Read more comments on GitHub >

github_iconTop Results From Across the Web

RTX 3080 FTW3 GPU usage consistently low, but ... - Reddit
I just received my 3080 through EVGA's queue system. I've been running through some games to check performance, and I'm experiencing low GPU...
Read more >
HOW TO FIX Low GPU Usage and Low FPS [ 2022 Guide ]
Get 20% discount with my "SKAG" code.Win10 PRO ($13): https://biitt.ly/70jw3Office 2021 ($45): https://biitt.ly/hgit8Win10 Home ($11): ...
Read more >
Monitor and Improve GPU Usage for Training Deep Learning ...
Measure your GPU usage consistently over your entire training runs ... It's not hard to take a snapshot of your usage with useful...
Read more >
Redefining the role of the CPU in the era of CPU-GPU ...
Find, read and cite all the research you need on ResearchGate. ... further decreasing barriers to the use of the GPU to offload...
Read more >
Re-thinking CNN Frameworks for Time-Sensitive Autonomous ...
Requests for copy operations use specialized hardware on the GPU but are placed in the FIFO stream queues in order with the kernels....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found