question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory / memory leak debugging

See original GitHub issue

Hi,

This is less of a clear bug report and more a writeup of some debugging I recently did around weird memory leak like issues while running code using dask. I hope this will save someone a bit of time in the future.

I’m trying to use dask to run some simple code in parallel, as a better multiprocessing. Roughly this:

def fun(x):
    return x + 1
tasks = [delayed(fun)(i) for i in range(1000)]
futs = client.compute(tasks)

In practice fun is a bit more complex, and reads data from s3, does a bit of computation and writes results back to s3.

When running the real example in parallel I was seeing really slow scheduling, and workers slowly ran out of memory. It seemed to scale roughly with the number of tasks, each task takes around 30 seconds on a single core, and when I was running 10 of them it all worked perfectly. 500 still worked, 1000 were borderline and 10000 certainly didn’t.

My workers died with various exceptions related to memory usage, e.g.

distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting

After a bit of debugging this seems to be caused by a helper class that got passed into fun, a minimal example that breaks looks like this:

class S3FsWrapper(object):
    def __init__(self):
        self.fs = s3fs.S3FileSystem()
    def get_s3fs(self):
        return self.fs

fs = S3FsWrapper()

def fun(fs, x):
    # would do something with fs here, but not necessary to trigger OOM
    return x + 1

tasks = [delayed(fun)(fs, i) for i in range(1000)]
futs = client.compute(tasks)
progress(futs)

Running this will take ages, and depending on how much RAM you have will most likely crash.

So I had a look at where all this memory goes with pympler.muppy:

def debug_mem():
    from pympler import summary, muppy
    all_objects = muppy.get_objects()
    s = summary.summarize(all_objects)
    return s

s = client.run(debug_mem)

from pympler import summary, muppy
summary.print_(list(s.values())[0])
                                                  types |   # objects |   total size
======================================================= | =========== | ============
                        <class 'collections.OrderedDict |      372201 |    163.90 MB
                                            <class 'str |      598212 |     48.18 MB
                                           <class 'dict |       90573 |     23.05 MB
                                           <class 'list |       95265 |      7.50 MB
                             <class '_io.BufferedWriter |           3 |      4.25 MB
                                           <class 'code |       25294 |      3.49 MB
                                           <class 'type |        3317 |      3.39 MB
                        <class 'botocore.hooks.NodeList |       19000 |      1.45 MB
                                          <class 'tuple |       22560 |      1.42 MB
                                            <class 'set |        2921 |      1.28 MB
                                           <class 'cell |       22473 |      1.03 MB
  <class 'botocore.docs.docstring.ClientMethodDocstring |        7700 |    789.55 KB
                                        <class 'weakref |        5410 |    422.66 KB
                  <class 'botocore.model.OperationModel |        7700 |    421.09 KB
                                            <class 'int |        8753 |    261.06 KB

It looks like every task instance has loaded its own copy of botocore. All the strings contain AWS api descriptions, and I suspect the OrderedDicts are similar.

So this is how far I’ve gotten. Runnable notebook is at https://github.com/ah-/notebooks/blob/master/dask_oom.ipynb.

I have some ideas what exactly is going on underneath, but I’d be grateful for a clear explanation, and maybe some hints how to avoid this. I suspect this isn’t actually a dask bug but a side-effect of how data is serialised and passed around.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:14 (11 by maintainers)

github_iconTop GitHub Comments

3reactions
mrocklincommented, Mar 28, 2018

I recommend raising an issue upstream on s3fs noting that creating many S3FileSystems (or at least deserializing them) seems to make many botocore objects, and asking if there is a place where you can help to correct the issue.

0reactions
mrocklincommented, Mar 28, 2018

OK, I can reproduce the issue. Some details:

This helps when testing on larger systems

cluster = LocalCluster(..., memory_limit='2GB') 

When I watch the diagnostic dashboard I notice that memory jumps up quickly before any of the computations start. I suspect that this means that the memory cost isn’t in the results of the worker, its in the deserialized versions of the tasks themselves (the many Python functions). Generally we don’t have any controls on data like this that we expect to be small. I am not surprised to learn that Dask crashes here.

One thing that may help here would be if we were to do a bit of caching on deserialization. “Hey, I’ve seen this huge string of bytes recently, it turned into this function, I’ll just return that immediately rather than deserialize it again.” But this will likely have complications of its own.

I’m tempted to say “just don’t send hundreds of thousands of tasks that close over non-trivial data”

%time len(cloudpickle.dumps(fs))
CPU times: user 0 ns, sys: 4.06 ms, total: 4.06 ms
Wall time: 3.32 ms
Out[10]:
911

This isn’t that big serialized, but the serialization time is non-trivial and I wouldn’t be surprised if it’s much bigger when in memory.

Read more comments on GitHub >

github_iconTop Results From Across the Web

3 Troubleshoot Memory Leaks - Java - Oracle Help Center
The output shows that the dbx debugger reports memory leaks if memory is not freed at the time the process is about to...
Read more >
How to debug memory leaks in your application
To debug with the help of MAT tool, collect heapdump at points where you suspect there is a memory leak. Best practice suggests...
Read more >
memory leak debug - Stack Overflow
Check out your loops · Look at where you are allocating variables - do you ever de-allocate them? · Try and reproduce the...
Read more >
Debug a memory leak in .NET Core - Microsoft Learn
A memory leak may happen when your app references objects that it no longer needs to perform the desired task. Referencing said objects...
Read more >
Debugging Memory Leaks in Node.js Applications - Toptal
In this article, Toptal engineer Vladyslav Millier gives us insight into what memory leaks are, how some sophisticated debugging tools can be used...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found