RPC deadlocks after a node restart
See original GitHub issueWith @Quiark, we observed this issue with Corda 3.1 and prior versions: when one is doing some operation with RPC (e.g. starting a flow) and at that time, a node (that a particular RPC Client is connected to) shuts down and then starts up again, RPCOps is stuck. There’s an exception thrown on one of the threads, something like:
E 12:23:05+0800 [Thread-1 (ActiveMQ-client-global-threads)] DefaultPromise.rejectedExecution.error - Failed to submit a listener notification task. Event loop shut down? {}
java.util.concurrent.RejectedExecutionException: event executor terminated
at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:821) ~[netty-all-4.1.9.Final.jar:4.1.9.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:327) ~[netty-all-4.1.9.Final.jar:4.1.9.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:320) ~[netty-all-4.1.9.Final.jar:4.1.9.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:746) ~[netty-all-4.1.9.Final.jar:4.1.9.Final]
at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:760) [netty-all-4.1.9.Final.jar:4.1.9.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:428) [netty-all-4.1.9.Final.jar:4.1.9.Final]
at io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:113) [netty-all-4.1.9.Final.jar:4.1.9.Final]
at io.netty.channel.DefaultChannelPromise.setFailure(DefaultChannelPromise.java:87) [netty-all-4.1.9.Final.jar:4.1.9.Final]
at io.netty.channel.AbstractChannelHandlerContext.safeExecute(AbstractChannelHandlerContext.java:1010) [netty-all-4.1.9.Final.jar:4.1.9.Final]
at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:610) [netty-all-4.1.9.Final.jar:4.1.9.Final]
at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:465) [netty-all-4.1.9.Final.jar:4.1.9.Final]
at io.netty.channel.DefaultChannelPipeline.close(DefaultChannelPipeline.java:964) [netty-all-4.1.9.Final.jar:4.1.9.Final]
at io.netty.channel.AbstractChannel.close(AbstractChannel.java:234) [netty-all-4.1.9.Final.jar:4.1.9.Final]
at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.closeSSLAndChannel(NettyConnection.java:549) [artemis-core-client-2.2.0.jar:2.2.0]
at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.close(NettyConnection.java:245) [artemis-core-client-2.2.0.jar:2.2.0]
at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.internalClose(RemotingConnectionImpl.java:396) [artemis-core-client-2.2.0.jar:2.2.0]
at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.destroy(RemotingConnectionImpl.java:229) [artemis-core-client-2.2.0.jar:2.2.0]
at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:617) [artemis-core-client-2.2.0.jar:2.2.0]
at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:504) [artemis-core-client-2.2.0.jar:2.2.0]
at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:497) [artemis-core-client-2.2.0.jar:2.2.0]
at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.connectionException(ClientSessionFactoryImpl.java:368) [artemis-core-client-2.2.0.jar:2.2.0]
at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector$Listener$2.run(NettyConnector.java:1042) [artemis-core-client-2.2.0.jar:2.2.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.2.0.jar:2.2.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.2.0.jar:2.2.0]
at org.apache.activemq.artemis.utils.actors.ProcessorBase$ExecutorTask.run(ProcessorBase.java:53) [artemis-commons-2.2.0.jar:2.2.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (6 by maintainers)
Top Results From Across the Web
RPC deadlocks after a node restart - - Bountysource
With @Quiark, we observed this issue with Corda 3.1 and prior versions: when one is doing some operation with RPC (e.g. starting a...
Read more >IJ12492: DEADLOCK GROUPPROTOCOLDRIVERTHREAD ...
After logAssertFailed, restart gpfs, node showed as arbitrating, could not join back cluster. Cluster manager node have deadlock messages.
Read more >How to avoid MySQL 'Deadlock found when trying to get lock
One easy trick that can help with most deadlocks is sorting the operations in a specific order. You get a deadlock when two...
Read more >system node reboot - Product documentation - NetApp
The system node reboot command restarts a node. ... failover partner is not allowed to take over for the node when the node...
Read more >NFSv4 server restarts causes long pause in NFS client when ...
During the grace period, the server must reject READ and WRITE operations and non-reclaim locking requests (i.e., other LOCK and OPEN operations) ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@tomtau thank you for signalling this issue. It’s caused by Artemis not being too nice about non-durable queues and messages during failover. Specifically, messages don’t get re-sent if failover takes too long. Furthermore, there are RPCs that can’t really handle connection loss(the ones that return observables and send updates).
There was a pull request https://github.com/corda/corda/pull/2770 to address these issues. I believe it was done after 3.x was released. The behaviour at the moment is to throw an RPCException for any RPCs called during failover, and to also throw and clean-up ongoing RPCs. This will ensure no hanging in case server is unreachable. The above exception is thrown by the Artemis thread that handles reconnection. We’ll try to find a way to deal with it, but it doesn’t indicate anything wrong, really.
Issue fixed and will be included in 3.3