Potential race condition in Nanny
See original GitHub issueHi everybody, since a few days we’re seeing “random” failures in our CI due to distributed
emitting:
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 1>>
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
return self.callback()
File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/distributed/nanny.py", line 414, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
This feels like a race condition in some situation, e.g. closing of the Nanny
because the periodic callbacks are still running but Nanny.process
is already None.
I think (still investigating) we’ve started seeing this only after 2.20 was released.
I cannot attach an MFE yet simply because we don’t have one 😬 we only experience this on CI. I’d welcome any sort of feedback. I can try to give some context though: the error is triggered in a Jupyter Notebook by a cell which calls scipy.minimize
from a DASK worker (cell 14 here – look for optimize.minimize
). I doubt it’s ever going to be useful, but here’s an excerpt of the raw log (part of which I pasted above): note that it repeats over and over again for hundreds/thousands of lines…
Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:28 (11 by maintainers)
Top GitHub Comments
@fjetter thanks a bunch for the help. It looks like my conda environment displayed in the upper right of the notebook wasn’t actually active for some reason. I ran the following in ipython instead with the environment actually active:
and was able to get a more informative traceback with the main branch installed
using the correct arg
n_workers
solved my problem.I just ran into this issue on my local machine, so it is alive and well.
Results in this being outputted over and over again:
Known Causes:
os.getcwd()
is in a directory without write privileges triggers the error every time.Version Information
help(distributed)
gives version 2.30.0pip3 install dask[complete]
as rootfree -h
: