Are reference cycles a performance problem?
See original GitHub issue@gjoseph92 noticed that, under some profiling conditions, turning off garbage collection had a significant impact on scheduler performance. I’m going to include some notes from him in the summary below
Notes from Gabe
See https://github.com/dask/distributed/discussions/4825 for initial discussion of the problem. It also comes up on https://github.com/dask/distributed/issues/4881#issuecomment-859208156.
I’ve also run these with GC debug mode on (https://github.com/gjoseph92/dask-profiling-coiled/commit/c0ea2aa1) and looked at GC logs. Interestingly GC debug mode generally reports GC as taking zero time:
gc: done, 0 unreachable, 0 uncollectable, 0.0000s elapsed
Some of those logs are here: https://rawcdn.githack.com/gjoseph92/dask-profiling-coiled/61fc875173a5b2f9195346f2a523cb1d876c48ad/results/cython-shuffle-gc-debug-noprofiling-ecs-prod-nopyspy.txt?raw=true
The types of objects being listed as collectable are interesting (cells, frames, tracebacks, asyncio Futures/Tasks, SelectorKey) since those are the sorts of things you might expect to create cycles. It’s also interesting that there are already ~150k objects in generation 3 before the computation has even started, and ~300k (and growing) once it’s been running for a little bit.
I’ve also tried turning off:
- statistical profiling
- bokeh dashboard
- uvloop instead of native asyncio
But none of those affected the issue.
What I wanted to do next was use refcycle or objgraph or a similar tool to try to see what’s causing the cycles. Or possibly use tracemalloc + GC hooks to try to log where the objects that were being collected were initially created.
I notice that we have reference cycles in our scheduler state
In [1]: from dask.distributed import Client
In [2]: client = Client()
In [3]: import dask.array as da
In [4]: x = da.random.random((1000, 1000)).sum().persist()
In [5]: s = client.cluster.scheduler
In [6]: a, b = s.tasks.values()
In [7]: a
Out[7]: <TaskState "('sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" memory>
In [8]: b
Out[8]: <TaskState "('random_sample-sum-sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" released>
In [9]: a in b.dependents
Out[9]: True
In [10]: b in a.dependencies
Out[10]: True
Should we be concerned about our use of reference cycles?
Issue Analytics
- State:
- Created 2 years ago
- Comments:25 (17 by maintainers)
Top GitHub Comments
@dhirschfeld thanks a ton these links—they’re very helpful. Something like https://github.com/python-trio/trio/pull/1805#issuecomment-730633206 is what we’ll want to do to debug this.
Also, I wouldn’t be surprised if our issue is similar to this Trio one (raising exceptions within frames that hold references to other exceptions/frames/traceback objects), with a similar solution. I’ve been suspicious of this since seeing how so many of the objects being collected in the debug logs are Cells, Tracebacks, Frames, etc. Intuitively you’d read that sort of code and think “how does adding a
del
as the last line of a function matter; the function is about to go out of scope anyway.” But it’s a different story entirely when something (the raised exception) holds a reference to that Frame.the progressive increase of runtime over runs could be explained by a progressive increase of python objects that gc.collect() has to go through at every iteration - read: leak. Do you have a measure of that? Does it correlate with runtime? If positive, I suggest that you