question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Are reference cycles a performance problem?

See original GitHub issue

@gjoseph92 noticed that, under some profiling conditions, turning off garbage collection had a significant impact on scheduler performance. I’m going to include some notes from him in the summary below

Notes from Gabe

See https://github.com/dask/distributed/discussions/4825 for initial discussion of the problem. It also comes up on https://github.com/dask/distributed/issues/4881#issuecomment-859208156.

I’ve also run these with GC debug mode on (https://github.com/gjoseph92/dask-profiling-coiled/commit/c0ea2aa1) and looked at GC logs. Interestingly GC debug mode generally reports GC as taking zero time:

gc: done, 0 unreachable, 0 uncollectable, 0.0000s elapsed

Some of those logs are here: https://rawcdn.githack.com/gjoseph92/dask-profiling-coiled/61fc875173a5b2f9195346f2a523cb1d876c48ad/results/cython-shuffle-gc-debug-noprofiling-ecs-prod-nopyspy.txt?raw=true

The types of objects being listed as collectable are interesting (cells, frames, tracebacks, asyncio Futures/Tasks, SelectorKey) since those are the sorts of things you might expect to create cycles. It’s also interesting that there are already ~150k objects in generation 3 before the computation has even started, and ~300k (and growing) once it’s been running for a little bit.

I’ve also tried turning off:

  • statistical profiling
  • bokeh dashboard
  • uvloop instead of native asyncio

But none of those affected the issue.

What I wanted to do next was use refcycle or objgraph or a similar tool to try to see what’s causing the cycles. Or possibly use tracemalloc + GC hooks to try to log where the objects that were being collected were initially created.

I notice that we have reference cycles in our scheduler state

In [1]: from dask.distributed import Client
In [2]: client = Client()
In [3]: import dask.array as da
In [4]: x = da.random.random((1000, 1000)).sum().persist()
In [5]: s = client.cluster.scheduler
In [6]: a, b = s.tasks.values()

In [7]: a
Out[7]: <TaskState "('sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" memory>

In [8]: b
Out[8]: <TaskState "('random_sample-sum-sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" released>

In [9]: a in b.dependents
Out[9]: True
In [10]: b in a.dependencies
Out[10]: True

Should we be concerned about our use of reference cycles?

cc @jakirkham @pitrou

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:25 (17 by maintainers)

github_iconTop GitHub Comments

2reactions
gjoseph92commented, Jul 2, 2021

@dhirschfeld thanks a ton these links—they’re very helpful. Something like https://github.com/python-trio/trio/pull/1805#issuecomment-730633206 is what we’ll want to do to debug this.

Also, I wouldn’t be surprised if our issue is similar to this Trio one (raising exceptions within frames that hold references to other exceptions/frames/traceback objects), with a similar solution. I’ve been suspicious of this since seeing how so many of the objects being collected in the debug logs are Cells, Tracebacks, Frames, etc. Intuitively you’d read that sort of code and think “how does adding a del as the last line of a function matter; the function is about to go out of scope anyway.” But it’s a different story entirely when something (the raised exception) holds a reference to that Frame.

1reaction
crusaderkycommented, Jul 16, 2021

the progressive increase of runtime over runs could be explained by a progressive increase of python objects that gc.collect() has to go through at every iteration - read: leak. Do you have a measure of that? Does it correlate with runtime? If positive, I suggest that you

  1. do one run (to make sure everything’s warm)
  2. run pympler to log the id of all objects in the interpreter
  3. run many more runs and observe the progressive slowdown
  4. run pympler again to find all the objects in the interpreter again, then remove those that were already present in (2)
  5. use pympler.refbrowser (warning: mortally slow) to find out what’s referencing them and why they aren’t being garbage collected
Read more comments on GitHub >

github_iconTop Results From Across the Web

What Are Strong Reference Cycles - Cocoacasts
A strong reference prevents the class instance it points to from being deallocated and that introduces a risk. Let me show you a...
Read more >
Retain Cycles: Why is that such a bad thing? - Stack Overflow
Cycles aren't bad, but they are often avoided because they can make it tricky to ensure you haven't got memory leaks.
Read more >
Reference counting - Wikipedia
Reference counting performance does not deteriorate as the total amount of free space decreases.
Read more >
What's wrong with reference counting ? (except cycles) - Reddit
The pauses causes by reference counting can be long because freeing one object might cause other objects to hit a reference count of...
Read more >
Reference Cycles Can Leak Memory - The Rust Programming ...
Weak references don't express an ownership relationship, and their count doesn't affect when an Rc<T> instance is cleaned up. They won't cause a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found