IO thread loops infinitely on out event and blocks application with 100% CPU usage [was renamed]
See original GitHub issueUPDATE: All assumptions in this comment turned out to be wrong. The true problem is described below in comment https://github.com/zeromq/jeromq/issues/520#issuecomment-364570301
I’m using the latest version 0.4.3 of JeroMQ and my application stops after some time because messages are not sent over a PUSH socket while one of two JeroMQ I/O threads is at 100% CPU usage. The thread remains inside epollWait, see http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/sun/nio/ch/EPollArrayWrapper.java#269
I’ve seen issue https://github.com/zeromq/jeromq/issues/506 and think it’s very likely related. I don’t know about the defined behavior of epollWait and how it’s supposed to be used in JeroMQ, but I see in my application that the timeout argument is always zero whenever it hangs. As I thought it should be some positive timeout value (*) I replaced long timeout = executeTimers();
with long timeout = Math.max(10, executeTimers());
but it doesn’t solve the problem and the io thread still hangs forever inside epollWait. Since the thread hangs inside epollWait the selector recreate logic is not executed. Do you have any idea what’s wrong?
(*) I saw that JeroMQ implements a similar logic to workaround the JDK bug as netty; however, in netty epollWait is called with a timeout of 10.
In my application, I execute a test that uses four PULL sockets and various PUSH sockets. The test is started multiple times (because I do a bunch of simulations) in the same Java process but each time with a new context. After each test, sockets are closed and the context is closed gracefully.
Issue Analytics
- State:
- Created 6 years ago
- Comments:16 (10 by maintainers)
Sorry, no time for me to be on this topic this week.
I will be fast now and will provide more context later, but as long as your PR satisfies the terms of C4, it will be merged.
The following is not the project’s thinkings but my very own: when the lib was bumped to 4.1.7, I decided to stick as much as possible to the logic of libzmq so it would be better to maintain. Not everything can be translated back to Java (ByteBuffer or equivalents are not used in C++ version), but the more the code sticks to it, the better I feel. To answer your question, I personally would go for both in the given order:
In your PR, this comparison in EncoderBase makes a deviation compared to the C++ version, which I find hard to explain, while at the same time a code that is present only in Java to decide flipping the buffer or not seems to be the source of the bug you reported. If I was you, I would invest a bit of time to try to refine that Java-specific code. But I am not you 😃
If you can wait a week, I may be more available then.
SORRY @smattheis !