Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RPC deadlocks after a node restart

See original GitHub issue

With @Quiark, we observed this issue with Corda 3.1 and prior versions: when one is doing some operation with RPC (e.g. starting a flow) and at that time, a node (that a particular RPC Client is connected to) shuts down and then starts up again, RPCOps is stuck. There’s an exception thrown on one of the threads, something like:

E 12:23:05+0800 [Thread-1 (ActiveMQ-client-global-threads)] DefaultPromise.rejectedExecution.error - Failed to submit a listener notification task. Event loop shut down? {}
 java.util.concurrent.RejectedExecutionException: event executor terminated
    at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:821) ~[netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:327) ~[netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:320) ~[netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:746) ~[netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:760) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:428) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:113) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.channel.DefaultChannelPromise.setFailure(DefaultChannelPromise.java:87) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.channel.AbstractChannelHandlerContext.safeExecute(AbstractChannelHandlerContext.java:1010) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:610) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:465) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.channel.DefaultChannelPipeline.close(DefaultChannelPipeline.java:964) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at io.netty.channel.AbstractChannel.close(AbstractChannel.java:234) [netty-all-4.1.9.Final.jar:4.1.9.Final]
    at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.closeSSLAndChannel(NettyConnection.java:549) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.close(NettyConnection.java:245) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.internalClose(RemotingConnectionImpl.java:396) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.destroy(RemotingConnectionImpl.java:229) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:617) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:504) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:497) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.connectionException(ClientSessionFactoryImpl.java:368) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector$Listener$2.run(NettyConnector.java:1042) [artemis-core-client-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.2.0.jar:2.2.0]
    at org.apache.activemq.artemis.utils.actors.ProcessorBase$ExecutorTask.run(ProcessorBase.java:53) [artemis-commons-2.2.0.jar:2.2.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]

Issue Analytics

State:
Created 5 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

bpaunescucommented, Apr 13, 2018

@tomtau thank you for signalling this issue. It’s caused by Artemis not being too nice about non-durable queues and messages during failover. Specifically, messages don’t get re-sent if failover takes too long. Furthermore, there are RPCs that can’t really handle connection loss(the ones that return observables and send updates).

There was a pull request https://github.com/corda/corda/pull/2770 to address these issues. I believe it was done after 3.x was released. The behaviour at the moment is to throw an RPCException for any RPCs called during failover, and to also throw and clean-up ongoing RPCs. This will ensure no hanging in case server is unreachable. The above exception is thrown by the Artemis thread that handles reconnection. We’ll try to find a way to deal with it, but it doesn’t indicate anything wrong, really.

0reactions

bpaunescucommented, Sep 10, 2018

Issue fixed and will be included in 3.3

Top Results From Across the Web

RPC deadlocks after a node restart - - Bountysource

With @Quiark, we observed this issue with Corda 3.1 and prior versions: when one is doing some operation with RPC (e.g. starting a...

IJ12492: DEADLOCK GROUPPROTOCOLDRIVERTHREAD ...

After logAssertFailed, restart gpfs, node showed as arbitrating, could not join back cluster. Cluster manager node have deadlock messages.

How to avoid MySQL 'Deadlock found when trying to get lock

One easy trick that can help with most deadlocks is sorting the operations in a specific order. You get a deadlock when two...

system node reboot - Product documentation - NetApp

The system node reboot command restarts a node. ... failover partner is not allowed to take over for the node when the node...

NFSv4 server restarts causes long pause in NFS client when ...

During the grace period, the server must reject READ and WRITE operations and non-reclaim locking requests (i.e., other LOCK and OPEN operations) ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

RPC deadlocks after a node restart

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Question: How does a non-validating notary knows if a participant is allowed to spend a state?

CRaSH shell does not allow user to specify param types, preventing use of certain commands