question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RPC deadlocks after a node restart

See original GitHub issue

With @Quiark, we observed this issue with Corda 3.1 and prior versions: when one is doing some operation with RPC (e.g. starting a flow) and at that time, a node (that a particular RPC Client is connected to) shuts down and then starts up again, RPCOps is stuck. There’s an exception thrown on one of the threads, something like:

E 12:23:05+0800 [Thread-1 (ActiveMQ-client-global-threads)] DefaultPromise.rejectedExecution.error - Failed to submit a listener notification task. Event loop shut down? {}
 java.util.concurrent.RejectedExecutionException: event executor terminated
    at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:821) ~[netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:327) ~[netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:320) ~[netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:746) ~[netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:760) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:428) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:113) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.channel.DefaultChannelPromise.setFailure(DefaultChannelPromise.java:87) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.channel.AbstractChannelHandlerContext.safeExecute(AbstractChannelHandlerContext.java:1010) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:610) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:465) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.channel.DefaultChannelPipeline.close(DefaultChannelPipeline.java:964) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.channel.AbstractChannel.close(AbstractChannel.java:234) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.closeSSLAndChannel(NettyConnection.java:549) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.close(NettyConnection.java:245) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.internalClose(RemotingConnectionImpl.java:396) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.destroy(RemotingConnectionImpl.java:229) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:617) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:504) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:497) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.connectionException(ClientSessionFactoryImpl.java:368) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector$Listener$2.run(NettyConnector.java:1042) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.utils.actors.ProcessorBase$ExecutorTask.run(ProcessorBase.java:53) [artemis-commons-2.2.0.jar:2.2.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
bpaunescucommented, Apr 13, 2018

@tomtau thank you for signalling this issue. It’s caused by Artemis not being too nice about non-durable queues and messages during failover. Specifically, messages don’t get re-sent if failover takes too long. Furthermore, there are RPCs that can’t really handle connection loss(the ones that return observables and send updates).

There was a pull request https://github.com/corda/corda/pull/2770 to address these issues. I believe it was done after 3.x was released. The behaviour at the moment is to throw an RPCException for any RPCs called during failover, and to also throw and clean-up ongoing RPCs. This will ensure no hanging in case server is unreachable. The above exception is thrown by the Artemis thread that handles reconnection. We’ll try to find a way to deal with it, but it doesn’t indicate anything wrong, really.

0reactions
bpaunescucommented, Sep 10, 2018

Issue fixed and will be included in 3.3

Read more comments on GitHub >

github_iconTop Results From Across the Web

RPC deadlocks after a node restart - - Bountysource
With @Quiark, we observed this issue with Corda 3.1 and prior versions: when one is doing some operation with RPC (e.g. starting a...
Read more >
IJ12492: DEADLOCK GROUPPROTOCOLDRIVERTHREAD ...
After logAssertFailed, restart gpfs, node showed as arbitrating, could not join back cluster. Cluster manager node have deadlock messages.
Read more >
How to avoid MySQL 'Deadlock found when trying to get lock
One easy trick that can help with most deadlocks is sorting the operations in a specific order. You get a deadlock when two...
Read more >
system node reboot - Product documentation - NetApp
The system node reboot command restarts a node. ... failover partner is not allowed to take over for the node when the node...
Read more >
NFSv4 server restarts causes long pause in NFS client when ...
During the grace period, the server must reject READ and WRITE operations and non-reclaim locking requests (i.e., other LOCK and OPEN operations) ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found