Deadlocks happening with version `1.4.0`
See original GitHub issueAfter upgrading clearml
to 1.4.0,
our code started producing deadlocks that had not been happening before.
The part of our code that was caught in a deadlock was inside multiprocessing.Pool()
.
We used Python3.8 and Tensorflow2.8. We use ClearML
for experiment tracking only. We create Task
objects manually and report scalars using clearml.Logger.report_scalar()
.
Disabling clearml
solved the issue. We pinpointed https://github.com/allegroai/clearml/commit/7625de3f2fec0eb641024ce7ca70a7d31083fa23 as the culprit. Downgrading to the commit before solved the problem.
We noticed that you patched the OS forking 👇 , which may cause weird issues like ours. https://github.com/allegroai/clearml/blob/bca9a6de3095f411ae5b766d00967535a13e8401/clearml/binding/environ_bind.py#L61
After commenting out this line, https://github.com/allegroai/clearml/blob/bca9a6de3095f411ae5b766d00967535a13e8401/clearml/task.py#L628-L629 the deadlock did not happen anymore. Disclaimer, we did not test this. The purpose was to get a quick feeling if this hypothesis has merit.
You probably have reasons for pathing the os forking mechanism; however, from reading the comments in that part of the code, it is not entirely clear what root cause you’re solving.
Issue Analytics
- State:
- Created a year ago
- Comments:10 (4 by maintainers)
Quick update, v1.4.1 is out with a fix 😄
pip install clearml==1.4.1
@vlad-ivanov-name
Actually it is there to support using the
Task
object from the forked process. Python is not very good with subprocesses, basically it’s on the user to do any bookkeeping on background threads running on the main process that will Not be replicated into the forked process. This means that for example the background-reporting-thread needs to be re-created on the forked process. These kind of things could be solved with lazy loading only when a new report is generated (i.e. console reporting, or event reporting), see below for more on that.Actually this is not really needed, because the same “hook” on the main process will be replicated into the subprocess, the missing part is actually sending the data to the server (which is always done in a background thread/process, see above answer)
I’m hoping that we will be able to remove the need for patching the fork call, we are now testings a new “deferred Task init” call, that basically does the initial handshaking in the background, the by product of that is that we will be able to “lazy load” the missing background-report-thread (see above) on the forked process, Without the need patch the fork call.
I hope this helps, this is a complicated flow with lots of ins and outs, I’m hoping this makes at least a bit clearer (pun intended 😃 )
Hi @alekpikl,
You’re absolutely right - our apologies 🙏 🙁 We do have an internal test that should have caught it… I’ll check why our automated testing pipeline failed to detect that.
We’ll release an RC as soon as possible, will update here