question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hi,

when the below example is run, the RAM usage grows forever:

import torch, torch.utils.data
import nonechucks

class DummyDataset(torch.utils.data.Dataset):
    def __len__(self):
        return 1_000_000

    def __getitem__(self, idx):
        return 666

dataset = nonechucks.SafeDataset(DummyDataset())

for _ in torch.utils.data.DataLoader(dataset):
    pass

Notes:

  • Here the increase is quite slow; for a RAPID bug demonstration, replace 666 with torch.empty(10_000) (be careful to kill the process in time, before you’re OOM!).
  • No problems without SafeDataset.
  • Without torch.utils.data.DataLoader, the leak is still there, although at a smaller scale, around 1 MB of RAM is lost per 30000-40000 __getitem__ calls.
  • PyTorch 1.0.1, nonechucks 0.3.1.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:10 (3 by maintainers)

github_iconTop GitHub Comments

5reactions
timonbimoncommented, Sep 11, 2019

Has this been resolved? 😃 Nonechucks looks pretty useful, but it’s very normal for us to have datasets that are much largern than the RAM, so with the memory leak this would be a no-go.

2reactions
aronhoffcommented, Jul 20, 2019

I tried adding a bool attribute to the memoize class, that would make its __call__ skip the lookup. The idea is that you could use it as a property of the method, e.g. self.__getitem__.memoize = False. (c57e70ce20668415af77ce8eb5120ab35e6fb840)

Unfortunately this does not work with multiprocessing. __getitem__ is not in the __dict__ of an object, so pickling does not save or restore it and its attributes. Putting it into __dict__ does not solve this.

I do not currently have more time to find a way. Solution could be in having a custom __setstate__ in SafeDataset, or perhaps using a custom metaclass for it. Or do it in the __getitem__ function directly. Either way, it seems to need more entanglement between memoize and the owner object.

Perhaps concerns should be separated completely. You could have a MemoizedDataset that passes through the values from another dataset which is potentially a SafeDataset, while memoizing them.

Keep in mind that the cache dicts are going to be different instances accross processes, so you would be replicating your dataset for each subprocess. This may not be what you want, but for small datasets, it could be not worth the effort of dealing with read-write shared memory.

And with regards to the dataset, ImageNet would optimistically take around 200 GB 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory leak - Wikipedia
In computer science, a memory leak is a type of resource leak that occurs when a computer program incorrectly manages memory allocations in...
Read more >
What is Memory Leak? How can we avoid? - GeeksforGeeks
Memory leak occurs when programmers create a memory in heap and forget to delete it. The consequences of memory leak is that it...
Read more >
Definition of memory leak - PCMag
When memory is allocated, but not deallocated, a memory leak occurs (the memory has leaked out of the computer). If too many memory...
Read more >
Memory Leaks and Garbage Collection | Computerworld
DEFINITION A memory leak is the gradual deterioration of system performance that occurs over time as the result of the fragmentation of a...
Read more >
Find a memory leak - Windows drivers - Microsoft Learn
A memory leak occurs when a process allocates memory from the paged or nonpaged pools, but doesn't free the memory.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found