question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deadlocks happening with version `1.4.0`

See original GitHub issue

After upgrading clearml to 1.4.0, our code started producing deadlocks that had not been happening before.

The part of our code that was caught in a deadlock was inside multiprocessing.Pool().

We used Python3.8 and Tensorflow2.8. We use ClearML for experiment tracking only. We create Task objects manually and report scalars using clearml.Logger.report_scalar().

Disabling clearml solved the issue. We pinpointed https://github.com/allegroai/clearml/commit/7625de3f2fec0eb641024ce7ca70a7d31083fa23 as the culprit. Downgrading to the commit before solved the problem.

We noticed that you patched the OS forking 👇 , which may cause weird issues like ours. https://github.com/allegroai/clearml/blob/bca9a6de3095f411ae5b766d00967535a13e8401/clearml/binding/environ_bind.py#L61

After commenting out this line, https://github.com/allegroai/clearml/blob/bca9a6de3095f411ae5b766d00967535a13e8401/clearml/task.py#L628-L629 the deadlock did not happen anymore. Disclaimer, we did not test this. The purpose was to get a quick feeling if this hypothesis has merit.

You probably have reasons for pathing the os forking mechanism; however, from reading the comments in that part of the code, it is not entirely clear what root cause you’re solving.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
bmartinncommented, May 17, 2022

Quick update, v1.4.1 is out with a fix 😄 pip install clearml==1.4.1

@vlad-ivanov-name

So would it be correct to say this patch exists to capture stdout/stderr of forked processes?

Actually it is there to support using the Task object from the forked process. Python is not very good with subprocesses, basically it’s on the user to do any bookkeeping on background threads running on the main process that will Not be replicated into the forked process. This means that for example the background-reporting-thread needs to be re-created on the forked process. These kind of things could be solved with lazy loading only when a new report is generated (i.e. console reporting, or event reporting), see below for more on that.

wonder if it could be achieved by creating

Actually this is not really needed, because the same “hook” on the main process will be replicated into the subprocess, the missing part is actually sending the data to the server (which is always done in a background thread/process, see above answer)

I’m hoping that we will be able to remove the need for patching the fork call, we are now testings a new “deferred Task init” call, that basically does the initial handshaking in the background, the by product of that is that we will be able to “lazy load” the missing background-report-thread (see above) on the forked process, Without the need patch the fork call.

I hope this helps, this is a complicated flow with lots of ins and outs, I’m hoping this makes at least a bit clearer (pun intended 😃 )

1reaction
jkhenningcommented, May 11, 2022

Hi @alekpikl,

You’re absolutely right - our apologies 🙏 🙁 We do have an internal test that should have caught it… I’ll check why our automated testing pipeline failed to detect that.

We’ll release an RC as soon as possible, will update here

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pytorch 1.4.0 deadlock in multiprocessing #32299 - GitHub
Bug PyTorch 1.4.0 deadlocks when using queues and events with multiprocessing. To Reproduce Minimal example: import torch.multiprocessing as ...
Read more >
Known Issues and Workarounds - Crochet - Read the Docs
The solution is to interrupt all blocking calls yourself. You can do this by firing or canceling any Deferred instances you are waiting...
Read more >
Lockups with Gigabeam FW 1.4.0 - Ubiquiti Community
has anyone found a fix for the lockup issues. I've put in a support ticket. but now its the weekend and I know...
Read more >
Hangfire deadlocks on server restart - bug?
bug? We have an azure websites that is running Hangfire which appears to be deadlocking when we restart. We scaled down to just...
Read more >
Download DeadLock - MajorGeeks
DeadLock is a lightweight file lock management utility designed specifically to unlock files that you’re unable to delete, rename or move.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found