Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak

See original GitHub issue

Hi,

when the below example is run, the RAM usage grows forever:

import torch, torch.utils.data
import nonechucks

class DummyDataset(torch.utils.data.Dataset):
    def __len__(self):
        return 1_000_000

    def __getitem__(self, idx):
        return 666

dataset = nonechucks.SafeDataset(DummyDataset())

for _ in torch.utils.data.DataLoader(dataset):
    pass

Notes:

Here the increase is quite slow; for a RAPID bug demonstration, replace 666 with torch.empty(10_000) (be careful to kill the process in time, before you’re OOM!).
No problems without SafeDataset.
Without torch.utils.data.DataLoader, the leak is still there, although at a smaller scale, around 1 MB of RAM is lost per 30000-40000 __getitem__ calls.
PyTorch 1.0.1, nonechucks 0.3.1.

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:10 (3 by maintainers)

Top GitHub Comments

5reactions

timonbimoncommented, Sep 11, 2019

Has this been resolved? 😃 Nonechucks looks pretty useful, but it’s very normal for us to have datasets that are much largern than the RAM, so with the memory leak this would be a no-go.

2reactions

aronhoffcommented, Jul 20, 2019

I tried adding a bool attribute to the memoize class, that would make its __call__ skip the lookup. The idea is that you could use it as a property of the method, e.g. self.__getitem__.memoize = False. (c57e70ce20668415af77ce8eb5120ab35e6fb840)

Unfortunately this does not work with multiprocessing. __getitem__ is not in the __dict__ of an object, so pickling does not save or restore it and its attributes. Putting it into __dict__ does not solve this.

I do not currently have more time to find a way. Solution could be in having a custom __setstate__ in SafeDataset, or perhaps using a custom metaclass for it. Or do it in the __getitem__ function directly. Either way, it seems to need more entanglement between memoize and the owner object.

Perhaps concerns should be separated completely. You could have a MemoizedDataset that passes through the values from another dataset which is potentially a SafeDataset, while memoizing them.

Keep in mind that the cache dicts are going to be different instances accross processes, so you would be replicating your dataset for each subprocess. This may not be what you want, but for small datasets, it could be not worth the effort of dealing with read-write shared memory.

And with regards to the dataset, ImageNet would optimistically take around 200 GB 😃