Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Are reference cycles a performance problem?

See original GitHub issue

@gjoseph92 noticed that, under some profiling conditions, turning off garbage collection had a significant impact on scheduler performance. I’m going to include some notes from him in the summary below

Notes from Gabe

See https://github.com/dask/distributed/discussions/4825 for initial discussion of the problem. It also comes up on https://github.com/dask/distributed/issues/4881#issuecomment-859208156.

I’ve also run these with GC debug mode on (https://github.com/gjoseph92/dask-profiling-coiled/commit/c0ea2aa1) and looked at GC logs. Interestingly GC debug mode generally reports GC as taking zero time:

gc: done, 0 unreachable, 0 uncollectable, 0.0000s elapsed

Some of those logs are here: https://rawcdn.githack.com/gjoseph92/dask-profiling-coiled/61fc875173a5b2f9195346f2a523cb1d876c48ad/results/cython-shuffle-gc-debug-noprofiling-ecs-prod-nopyspy.txt?raw=true

The types of objects being listed as collectable are interesting (cells, frames, tracebacks, asyncio Futures/Tasks, SelectorKey) since those are the sorts of things you might expect to create cycles. It’s also interesting that there are already ~150k objects in generation 3 before the computation has even started, and ~300k (and growing) once it’s been running for a little bit.

I’ve also tried turning off:

statistical profiling
bokeh dashboard
uvloop instead of native asyncio

But none of those affected the issue.

What I wanted to do next was use refcycle or objgraph or a similar tool to try to see what’s causing the cycles. Or possibly use tracemalloc + GC hooks to try to log where the objects that were being collected were initially created.

I notice that we have reference cycles in our scheduler state

In [1]: from dask.distributed import Client
In [2]: client = Client()
In [3]: import dask.array as da
In [4]: x = da.random.random((1000, 1000)).sum().persist()
In [5]: s = client.cluster.scheduler
In [6]: a, b = s.tasks.values()

In [7]: a
Out[7]: <TaskState "('sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" memory>

In [8]: b
Out[8]: <TaskState "('random_sample-sum-sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" released>

In [9]: a in b.dependents
Out[9]: True
In [10]: b in a.dependencies
Out[10]: True

Should we be concerned about our use of reference cycles?

cc @jakirkham @pitrou

Issue Analytics

State:
Created 2 years ago
Comments:25 (17 by maintainers)

Top GitHub Comments

2reactions

gjoseph92commented, Jul 2, 2021

@dhirschfeld thanks a ton these links—they’re very helpful. Something like https://github.com/python-trio/trio/pull/1805#issuecomment-730633206 is what we’ll want to do to debug this.

Also, I wouldn’t be surprised if our issue is similar to this Trio one (raising exceptions within frames that hold references to other exceptions/frames/traceback objects), with a similar solution. I’ve been suspicious of this since seeing how so many of the objects being collected in the debug logs are Cells, Tracebacks, Frames, etc. Intuitively you’d read that sort of code and think “how does adding a del as the last line of a function matter; the function is about to go out of scope anyway.” But it’s a different story entirely when something (the raised exception) holds a reference to that Frame.

1reaction

crusaderkycommented, Jul 16, 2021

the progressive increase of runtime over runs could be explained by a progressive increase of python objects that gc.collect() has to go through at every iteration - read: leak. Do you have a measure of that? Does it correlate with runtime? If positive, I suggest that you

do one run (to make sure everything’s warm)
run pympler to log the id of all objects in the interpreter
run many more runs and observe the progressive slowdown
run pympler again to find all the objects in the interpreter again, then remove those that were already present in (2)
use pympler.refbrowser (warning: mortally slow) to find out what’s referencing them and why they aren’t being garbage collected