question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AsynchronousTlsChannelGroup#processPendingInterests can throw CancelledKeyException

See original GitHub issue

In tests of the MongoDB Java driver that use this library, I’ve seen occasional, non-deterministic failures where AsynchronousTlsChannelGroup#processPendingInterests throws CancelledKeyException, causing AsynchronousTlsChannelGroup.loop to exit. It happens in cases where we are forcing the server to close the socket in order to test failure scenarios.

I’m not exactly sure why this is happening, but I do see that in AsynchronousTlsChannelGroup.loop there is already code that wraps calls to java.nio.channels.SelectionKey#interestOps(int) in a try/catch of CancelledKeyException. Does it make sense to do a similar thing in AsynchronousTlsChannelGroup#processPendingInterests , e.g.

  private void processPendingInterests() {
    for (SelectionKey key : selector.keys()) {
      RegisteredSocket socket = (RegisteredSocket) key.attachment();
      int pending = socket.pendingOps.getAndSet(0);
      if (pending != 0) {
        try {
          key.interestOps(key.interestOps() | pending);
        } catch (CancelledKeyException e) {
          // ignore
        }
      }
    }
  }

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
stIncMalecommented, Mar 3, 2021

@martinandersson Below is my explanation of the problem and the fix.

The method AsynchronousTlsChannelGroup.RegisteredSocket.close may be called by any thread as a result of it calling AsynchronousTlsChannel.close. AsynchronousTlsChannelGroup.RegisteredSocket.close calls SelectionKey.cancel, which

Requests that the registration of this key’s channel with its selector be cancelled. Upon return the key will be invalid and will have been added to its selector’s cancelled-key set. The key will be removed from all of the selector’s key sets during the next selection operation.

If we now look at Selector, we can see that

The cancelled-key set is the set of keys that have been cancelled but whose channels have not yet been deregistered. This set is not directly accessible. The cancelled-key set is always a subset of the key set.

This means that selector.keys in AsynchronousTlsChannelGroup.processPendingInterests may return cancelled SelectionKeys. Calling key.interestOps on such SelectionKeys results in CancelledKeyException as per the documentation of SelectionKey.interestOps.

Thus, depending on how AsynchronousTlsChannelGroup and AsynchronousTlsChannel are used in a program, the program may have a race condition.

Two approaches are possible:

  1. change the usage of SelectionKey.cancel, Selector.select, Selector.keys, SelectionKey.isValid, SelectionKey.interestOps methods in AsynchronousTlsChannelGroup in such a way that there can be no such race condition anymore;
  2. catch and ignore CancelledKeyException when it happens as a result of a program having the race condition.

The second approach seems (maybe surprisingly) more optimal in this case because it is both simpler and introduces smaller performance overhead assuming that CancelledKeyException is thrown much more rarely than the method SelectionKey.cancel is called.

0reactions
marianobarrioscommented, Mar 14, 2021

The Selector API is already racy here. But a lot of “closing workflows” are racy and benign, as typically not much happens after a close to matter anyway.

Something that would help: having a test that show non-deterministic behavior due to this race.

Read more comments on GitHub >

github_iconTop Results From Across the Web

httpcomponents-core/RELEASE_NOTES.txt at master - GitHub
the local TLS engine quietly closes the stream instead of throwing a handshake. exception. Contributed by Oleg Kalnichevski <olegk at apache.org>.
Read more >
Release Notes
Improved support for TLS upgrade and HTTP protocol upgrade (async). ... to stop reading from the underlying network channel of READ interest is...
Read more >
Is it possible to make two-way SSL asynchronous?
The server can request a client certificate inside the initial TLS handshake but not verify the client certificate inside the handshake, ...
Read more >
The Transport Layer Security (TLS) Protocol Version 1.2
The security parameters for the pending states can be set by the TLS Handshake Protocol, and the ChangeCipherSpec can selectively make either of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found