Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory / memory leak debugging

See original GitHub issue

Hi,

This is less of a clear bug report and more a writeup of some debugging I recently did around weird memory leak like issues while running code using dask. I hope this will save someone a bit of time in the future.

I’m trying to use dask to run some simple code in parallel, as a better multiprocessing. Roughly this:

def fun(x):
    return x + 1
tasks = [delayed(fun)(i) for i in range(1000)]
futs = client.compute(tasks)

In practice fun is a bit more complex, and reads data from s3, does a bit of computation and writes results back to s3.

When running the real example in parallel I was seeing really slow scheduling, and workers slowly ran out of memory. It seemed to scale roughly with the number of tasks, each task takes around 30 seconds on a single core, and when I was running 10 of them it all worked perfectly. 500 still worked, 1000 were borderline and 10000 certainly didn’t.

My workers died with various exceptions related to memory usage, e.g.

distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting

After a bit of debugging this seems to be caused by a helper class that got passed into fun, a minimal example that breaks looks like this:

class S3FsWrapper(object):
    def __init__(self):
        self.fs = s3fs.S3FileSystem()
    def get_s3fs(self):
        return self.fs

fs = S3FsWrapper()

def fun(fs, x):
    # would do something with fs here, but not necessary to trigger OOM
    return x + 1

tasks = [delayed(fun)(fs, i) for i in range(1000)]
futs = client.compute(tasks)
progress(futs)

Running this will take ages, and depending on how much RAM you have will most likely crash.

So I had a look at where all this memory goes with pympler.muppy:

def debug_mem():
    from pympler import summary, muppy
    all_objects = muppy.get_objects()
    s = summary.summarize(all_objects)
    return s

s = client.run(debug_mem)

from pympler import summary, muppy
summary.print_(list(s.values())[0])

                                                  types |   # objects |   total size
======================================================= | =========== | ============
                        <class 'collections.OrderedDict |      372201 |    163.90 MB
                                            <class 'str |      598212 |     48.18 MB
                                           <class 'dict |       90573 |     23.05 MB
                                           <class 'list |       95265 |      7.50 MB
                             <class '_io.BufferedWriter |           3 |      4.25 MB
                                           <class 'code |       25294 |      3.49 MB
                                           <class 'type |        3317 |      3.39 MB
                        <class 'botocore.hooks.NodeList |       19000 |      1.45 MB
                                          <class 'tuple |       22560 |      1.42 MB
                                            <class 'set |        2921 |      1.28 MB
                                           <class 'cell |       22473 |      1.03 MB
  <class 'botocore.docs.docstring.ClientMethodDocstring |        7700 |    789.55 KB
                                        <class 'weakref |        5410 |    422.66 KB
                  <class 'botocore.model.OperationModel |        7700 |    421.09 KB
                                            <class 'int |        8753 |    261.06 KB

It looks like every task instance has loaded its own copy of botocore. All the strings contain AWS api descriptions, and I suspect the OrderedDicts are similar.

So this is how far I’ve gotten. Runnable notebook is at https://github.com/ah-/notebooks/blob/master/dask_oom.ipynb.

I have some ideas what exactly is going on underneath, but I’d be grateful for a clear explanation, and maybe some hints how to avoid this. I suspect this isn’t actually a dask bug but a side-effect of how data is serialised and passed around.

Issue Analytics

State:
Created 5 years ago
Comments:14 (11 by maintainers)

Top GitHub Comments

3reactions

mrocklincommented, Mar 28, 2018

I recommend raising an issue upstream on s3fs noting that creating many S3FileSystems (or at least deserializing them) seems to make many botocore objects, and asking if there is a place where you can help to correct the issue.

0reactions

mrocklincommented, Mar 28, 2018

OK, I can reproduce the issue. Some details:

This helps when testing on larger systems

cluster = LocalCluster(..., memory_limit='2GB')

When I watch the diagnostic dashboard I notice that memory jumps up quickly before any of the computations start. I suspect that this means that the memory cost isn’t in the results of the worker, its in the deserialized versions of the tasks themselves (the many Python functions). Generally we don’t have any controls on data like this that we expect to be small. I am not surprised to learn that Dask crashes here.

One thing that may help here would be if we were to do a bit of caching on deserialization. “Hey, I’ve seen this huge string of bytes recently, it turned into this function, I’ll just return that immediately rather than deserialize it again.” But this will likely have complications of its own.

I’m tempted to say “just don’t send hundreds of thousands of tasks that close over non-trivial data”

%time len(cloudpickle.dumps(fs))
CPU times: user 0 ns, sys: 4.06 ms, total: 4.06 ms
Wall time: 3.32 ms
Out[10]:
911

This isn’t that big serialized, but the serialization time is non-trivial and I wouldn’t be surprised if it’s much bigger when in memory.

Top Results From Across the Web

3 Troubleshoot Memory Leaks - Java - Oracle Help Center

The output shows that the dbx debugger reports memory leaks if memory is not freed at the time the process is about to...

How to debug memory leaks in your application

To debug with the help of MAT tool, collect heapdump at points where you suspect there is a memory leak. Best practice suggests...

memory leak debug - Stack Overflow

Check out your loops · Look at where you are allocating variables - do you ever de-allocate them? · Try and reproduce the...

Debug a memory leak in .NET Core - Microsoft Learn

A memory leak may happen when your app references objects that it no longer needs to perform the desired task. Referencing said objects...

Debugging Memory Leaks in Node.js Applications - Toptal

In this article, Toptal engineer Vladyslav Millier gives us insight into what memory leaks are, how some sophisticated debugging tools can be used...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Out of memory / memory leak debugging

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Dashboard missing when starting in a Docker container

Client Startup fails when `n_workers` more than physical cores