Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deadlocks happening with version `1.4.0`

See original GitHub issue

After upgrading clearml to 1.4.0, our code started producing deadlocks that had not been happening before.

The part of our code that was caught in a deadlock was inside multiprocessing.Pool().

We used Python3.8 and Tensorflow2.8. We use ClearML for experiment tracking only. We create Task objects manually and report scalars using clearml.Logger.report_scalar().

Disabling clearml solved the issue. We pinpointed https://github.com/allegroai/clearml/commit/7625de3f2fec0eb641024ce7ca70a7d31083fa23 as the culprit. Downgrading to the commit before solved the problem.

We noticed that you patched the OS forking 👇 , which may cause weird issues like ours. https://github.com/allegroai/clearml/blob/bca9a6de3095f411ae5b766d00967535a13e8401/clearml/binding/environ_bind.py#L61

After commenting out this line, https://github.com/allegroai/clearml/blob/bca9a6de3095f411ae5b766d00967535a13e8401/clearml/task.py#L628-L629 the deadlock did not happen anymore. Disclaimer, we did not test this. The purpose was to get a quick feeling if this hypothesis has merit.

You probably have reasons for pathing the os forking mechanism; however, from reading the comments in that part of the code, it is not entirely clear what root cause you’re solving.

Issue Analytics

State:
Created a year ago
Comments:10 (4 by maintainers)

Top GitHub Comments

2reactions

bmartinncommented, May 17, 2022

Quick update, v1.4.1 is out with a fix 😄 pip install clearml==1.4.1

@vlad-ivanov-name

So would it be correct to say this patch exists to capture stdout/stderr of forked processes?

Actually it is there to support using the Task object from the forked process. Python is not very good with subprocesses, basically it’s on the user to do any bookkeeping on background threads running on the main process that will Not be replicated into the forked process. This means that for example the background-reporting-thread needs to be re-created on the forked process. These kind of things could be solved with lazy loading only when a new report is generated (i.e. console reporting, or event reporting), see below for more on that.

wonder if it could be achieved by creating

Actually this is not really needed, because the same “hook” on the main process will be replicated into the subprocess, the missing part is actually sending the data to the server (which is always done in a background thread/process, see above answer)

I’m hoping that we will be able to remove the need for patching the fork call, we are now testings a new “deferred Task init” call, that basically does the initial handshaking in the background, the by product of that is that we will be able to “lazy load” the missing background-report-thread (see above) on the forked process, Without the need patch the fork call.

I hope this helps, this is a complicated flow with lots of ins and outs, I’m hoping this makes at least a bit clearer (pun intended 😃 )

1reaction

jkhenningcommented, May 11, 2022

Hi @alekpikl,

You’re absolutely right - our apologies 🙏 🙁 We do have an internal test that should have caught it… I’ll check why our automated testing pipeline failed to detect that.

We’ll release an RC as soon as possible, will update here