question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

memory leak with min/max aggregation of huge array

See original GitHub issue

I’m running a single-machine cluster and try to load and process a movie larger than memory. However I seem to have memory leaking issue at various steps, especially when doing some aggregation. So I create the following simple script to track it down. Surprisingly even the first step in my workflow dask.array.image.imread seem to be enough to trigger some memory leak:

import os
import dask
import numpy as np
from dask.distributed import LocalCluster, Client, fire_and_forget
from dask.array.image import imread
cluster = LocalCluster(diagnostics_port=8989, memory_limit="200MB")
client = Client(cluster)
/opt/miniconda3/envs/dask/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/opt/miniconda3/envs/dask/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
array = dask.array.zeros((20000, 500, 800), chunks=(1, -1, -1))
array
dask.array<zeros, shape=(20000, 500, 800), dtype=float64, chunksize=(1, 500, 800)>
#this is perfectly fine despite that the array is larger than my real movie
arr_sum = client.compute(array.sum())
arr_sum.result()
0.0
#the folder contains ~18000 tiff files, but to isolate the issue I'm not actually reading them
dpath = "/home/phild/Documents/test_data/"
def dummy_read(im):
    return np.zeros((480, 752))
array = imread(os.path.join(dpath, "*.tiff"), dummy_read)
array
dask.array<imread, shape=(17991, 480, 752), dtype=float64, chunksize=(1, 480, 752)>
arr_sum = client.compute(array.sum())
arr_sum.result()
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker process 12465 was killed by signal 15
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker process 12463 was killed by signal 15
distributed.nanny - WARNING - Worker process 12484 was killed by signal 15
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker process 12474 was killed by signal 15
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker process 12479 was killed by signal 15
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker process 12477 was killed by signal 15
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 12468 was killed by signal 15
distributed.nanny - WARNING - Worker process 12472 was killed by signal 15
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker



---------------------------------------------------------------------------

KilledWorker                              Traceback (most recent call last)

<ipython-input-5-12b03eff355b> in <module>
      1 arr_sum = client.compute(array.sum())
----> 2 arr_sum.result()


/opt/miniconda3/envs/dask/lib/python3.7/site-packages/distributed/client.py in result(self, timeout)
    193                                   raiseit=False)
    194         if self.status == 'error':
--> 195             six.reraise(*result)
    196         elif self.status == 'cancelled':
    197             raise result


/opt/miniconda3/envs/dask/lib/python3.7/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None


KilledWorker: ("('imread-sum-22b1dc8c8dbc99d865bdc52557ca4d52', 5415, 0, 0)", 'tcp://127.0.0.1:44369')

The last sum produce memory leak and killed all my workers despite it’s smaller than the array created by dask.array.zeros, and I have something like the following in my worker logs:

distributed.worker - DEBUG - Calling gc.collect(). 3.435s elapsed since previous call.

distributed.worker - DEBUG - gc.collect() took 0.089s

distributed.worker - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 161.94 MB -- Worker memory limit: 200.00 MB

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 161.94 MB -- Worker memory limit: 200.00 MB

distributed.worker - DEBUG - Heartbeat skipped: channel busy

distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:32873

distributed.worker - DEBUG - gc.collect() lasts 0.089s but only 0.137s elapsed since last call: throttling.

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 162.17 MB -- Worker memory limit: 200.00 MB

distributed.worker - DEBUG - gc.collect() lasts 0.089s but only 0.144s elapsed since last call: throttling.

I understand that 200MB is probably not a reasonable memory limit for data of this size, but I suppose dask should handle it even under such extreme cases by spilling everything to disk (in fact I think that was what happened when we create array with dask.array.zeros), instead of complaining about memory leak?

If this is a expected behavior, however, I’d like to understand what is a rule of thumb for estimating the minimal required memory when performing such tasks?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:15 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Nov 28, 2018

Beating Dask to death is a productive and educational experience. I tried your example and got the same result (thanks for the simple example). I then tried it but with a larger chunk size of (100, -1, -1) (you have 200k tasks here) and it ran fine. In this case I suspect that the metadata to store all of the tasks was getting up to your 1 GB memory range. Generally speaking I recommend having workers with larger memory than 1GB, and also having fewer than 200k tasks.

On Tue, Nov 27, 2018 at 3:59 PM phildong notifications@github.com wrote:

sorry I’m a bit confused – does the warning just means “we are low on memory”, which is perfectly fine, or it means “we are low on memory And something else is going on that is out of control by dask”, which is very problematic?

If the later is true, here is a minimal example that will produce the warning on my machine, and I don’t see why there would be any data that is not tracked by dask and cannot be spilled to disk:

import daskimport dask.array as dafrom dask.distributed import LocalCluster, Client cluster = LocalCluster(diagnostics_port=8989, memory_limit=“1GB”) client = Client(cluster) array = da.zeros((200000, 500, 800), chunks=(1, -1, -1)) client.persist(array)

Sorry if I seems intentionally trying to beat dask to death, but this warning is the only thing that seems abnormal during my computation and my workflow inevitably involve holding huge video arrays on the workers and I really want them to be able to hold arbitrary size with potentially spilling to disk. Any other suggestion on how to do or debug this would be super helpful!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2377#issuecomment-442214935, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszAc4Yyx3oj9qbgCQgbplZYY1laIeks5uzafBgaJpZM4Y2Bba .

1reaction
mrocklincommented, Nov 27, 2018

It gives you a warning about a leak, or just that it’s low on memory? You’re asking for the entire array to be in memory at once, so it will start to push data onto disk, and warn you that it’s doing so.

On Tue, Nov 27, 2018 at 2:35 PM phildong notifications@github.com wrote:

I wouldn’t be surprised to learn that the majority of 200MB might be filled just by importing various scientific Python libraries.

In [1]: import psutil

In [2]: psutil.Process().memory_info().rss / 1e6 Out[2]: 44.490752

In [3]: import dask.array

In [4]: psutil.Process().memory_info().rss / 1e6 Out[4]: 64.192512

In [5]: import dask.array.image

In [6]: psutil.Process().memory_info().rss / 1e6 Out[6]: 113.864704

Hah! Thanks for the fast reply! I was silly in thinking that I can push memory_limit to extreme so that I could run into the leaking issue sooner with smaller size of data.

After I set memory_limit to 1GB (which I assume should be reasonable) the sum returns fine!

However I ran into memory leak again with the following attempt to normalize my entire array to the range of (0, 1):

arr_max = array.max() arr_min = array.min() arr_norm = (array - arr_min) / (arr_max - arr_min)

and a following call to client.persist(arr_norm) will give me killed worker and memory leak warnings in logs, no matter array was generated with dask.array.random or loaded from real tiffs (so I guess it’s no longer an imread issue per se now). I can also confirm that if I don’t do the normalization and just call client.persist(array), the array can sit in the cluster with probably part of it spilled to disk. So I assume I’m doing something stupid with the normalization?

As a side note, I also noticed just after I started the cluster, even if I do nothing, the “Bytes stored” under dashboard will very slowly increase, as if accumulating logs or something, even though I virtually turned off logging by setting “distributed: warning” in the configuration. Do you have any suggestion in how to track this down?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2377#issuecomment-442188181, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszJguIuVwh4COdRAdAV10ex19kdRzks5uzZP5gaJpZM4Y2Bba .

Read more comments on GitHub >

github_iconTop Results From Across the Web

Help: Diagnosing a Memory Leak / Bottleneck - NI Community
This tool gives Avg / Min / Max bytes and also Avg / Min / Max blocks. ... if you are managing large...
Read more >
Memory leak when using observable array with huge data in ...
I am having the following observable arrays in my view model: VM.ListingData([]); VM.ResumeListingData([]);.
Read more >
SAP HANA Troubleshooting and Performance Analysis Guide
Root cause: Problem caused by the configuration of transparent huge page ... The view Component Memory Usage shows the aggregated memory consumption in ......
Read more >
C++ Data Structures MIDTERM Flashcards | Chegg.com
Study C++ Data Structures MIDTERM flashcards. Create flashcards for FREE and quiz yourself with an interactive flipper.
Read more >
Fix List for DB2 Version 10.5 for Linux, UNIX and Windows - IBM
APAR Sev. Abstract IJ11422 2 MEMORY LEAK IN DB2TOP WHEN VIEWING DYNAMIC SQL IT28111 2 QUERIES USING HASH JOINS MAY HANG INTERMITTENTLY IT29647 2 SIG#11 RUNNING...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found