Hopeless SSL failures not reported to client implementation
See original GitHub issueHi folks, I’m writing an application that relies on a long-lived bi-directional stream of messages, with the client-server channel running over mTLS, and hitting a bit of a problem.
On startup the client begins a bidirectionally streaming RPC to the server, constructing a stream and sending a “register” message over it. The server does not send a “register acknowledge” message, and I’m unfortunately not in a position to modify the protocol here, only my implementation. This bidirectionally streaming RPC is kept running for the lifetime of the application (until either client or server crashes, which ideally happens rarely).
The certificates used on the client side expire and are re-issued frequently (as fast as every 5 minutes if a customer running this application decides that’s required), and we’ve written our client to re-build the underlying channel the RPC is being made over should the RPC end (e.g. if the server crashes and the clients StreamObserver::onError
is called).
We’re finding that, if the following occurs, our client is getting hung forever attempting to connect to the server with a set of certificates the server will never accept:
- Server crashes, client is disconnected and re-builds channel, taking in newly-issued set of certificates (cert-set A)
- Server remains offline while another new set of certificates is issued (cert-set B, cert-set A now expired)
- Server comes online, beings processing connection requests
- Client now in a state where it believes the initial
register
message has been processed, but actually the underlying connection is faulty and the channel is stuck in a retry loop forever.
Obviously using something like a deadline isn’t an option here, due to the protocol design.
I suspect the correct approach is to rely on something like a DelegatingSslContext
and call ``DelegatingSslContext::updateon detected certificate re-issue, however, it would be great if you happen to know a way for this particular failure to be detectable purely through the
StreamObserver` interface on the client side?
It seems like ideally it should be possible to detect this kind of “hopeless” situation (at a base level, the client certificate is expired, so the noAfter
will evaluate to a time in the past, which should be enough to be able to say the handshake will never succeed), but I understand it’s tricky - perhaps an SSLException or IOException passed to onError would be appropriate, but I’m not sure.
I’ve included a reproducing case below, with the caveat that rather than creating a client certificate and having it expire, the reproducing case simply has the server require client certificates and then has the client not send any - a similarly “hopeless” case, but without any tricky timing shenanigans. To reproduce:
- Run “genSecurityContext.sh” to generate a certificate authority and a server cert/key pair signed by that authority
- Modify the constant “SEC_MATERIAL” to point to wherever you ran “genSecurityContext.sh”
- Run the application via Main::main(), and note the client is never notified of the permanently broken netty channel
The case is packaged as a maven project, for the sake of convenience, but if you’re (understandably) leery about unzipping random files, the bulk of the logic is:
package com.ericsson.test;
import io.grpc.ManagedChannel;
import io.grpc.Server;
import io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts;
import io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder;
import io.grpc.netty.shaded.io.grpc.netty.NettyServerBuilder;
import io.grpc.netty.shaded.io.netty.handler.ssl.ClientAuth;
import io.grpc.netty.shaded.io.netty.handler.ssl.SslContext;
import io.grpc.netty.shaded.io.netty.handler.ssl.SslContextBuilder;
import io.grpc.stub.StreamObserver;
import javax.net.ssl.SSLException;
import java.io.File;
import java.io.IOException;
import java.util.concurrent.TimeUnit;
public class Main {
private static final String SEC_MATERIAL =
"C:\\Users\\ebrooli\\Documents\\projects\\ongoing\\GSSUPP-7063\\reproduce_grpc_failure\\src\\main\\resources"; // CHANGEME
private static final String SERVER_CERT = SEC_MATERIAL + "\\server.pem";
private static final String SERVER_KEY = SEC_MATERIAL + "\\server.key";
private static final String CA = SEC_MATERIAL + "\\ca.pem";
static {
System.setProperty("java.util.logging.config.file",
"C:\\Users\\ebrooli\\Documents\\projects\\ongoing\\GSSUPP-7063\\reproduce_grpc_failure\\src\\main\\resources\\logging.properties"); // CHANGEME
}
public static void main(String[] args) throws IOException, InterruptedException {
final var channel = getClientChannel();
final var clientReceiveStream = new PrintObserver();
final var clientSendStream = TestGrpc.newStub(channel).withWaitForReady().exchangeStream(clientReceiveStream);
System.out.println("Sending first message");
clientSendStream.onNext(Message.newBuilder().setPayload("test").build());
System.out.println("First message sent");
final var server = buildServer();
System.out.println("Server built");
clientSendStream.onNext(Message.newBuilder().setPayload("test2").build());
System.out.println("Second message sent");
server.awaitTermination();
}
private static Server buildServer() throws IOException {
final var service = new TestImpl();
return NettyServerBuilder.forPort(3000)
.addService(service)
.keepAliveTime(4, TimeUnit.SECONDS)
.keepAliveTimeout(1, TimeUnit.SECONDS)
.permitKeepAliveTime(10, TimeUnit.SECONDS)
.permitKeepAliveWithoutCalls(true)
.sslContext(getSslContext()).build().start();
}
private static ManagedChannel getClientChannel() throws SSLException {
return NettyChannelBuilder
.forAddress("localhost", 3000)
.keepAliveTime(2, TimeUnit.MINUTES)
.keepAliveTimeout(10, TimeUnit.SECONDS)
.keepAliveWithoutCalls(true)
.disableRetry()
.sslContext(getClientContext()).build();
}
private static SslContext getSslContext() throws SSLException {
final SslContextBuilder builder = GrpcSslContexts.forServer(new File(SERVER_CERT), new File(SERVER_KEY));
// Require the client to use mTLS, then when we don't use mTLS on the client side it looks like a handshake failure
builder.clientAuth(ClientAuth.REQUIRE);
return builder.build();
}
private static SslContext getClientContext() throws SSLException {
// We're going to setup the client for failure here by not providing a client cert
return GrpcSslContexts.forClient().trustManager(new File(CA)).build();
}
private static class TestImpl extends TestGrpc.TestImplBase {
@Override
public StreamObserver<Message> exchangeStream(StreamObserver<Message> clientStream) {
return new PrintObserver();
}
}
private static class PrintObserver implements StreamObserver<Message> {
@Override
public void onNext(Message message) {
System.out.println(message);
}
@Override
public void onError(Throwable throwable) {
System.out.println("onError called");
throwable.printStackTrace();
}
@Override
public void onCompleted() {
System.out.println("onCompleted");
}
}
}
syntax = "proto3";
option java_multiple_files = true;
option java_package = "com.ericsson.test";
option java_outer_classname = "TestService";
package com.ericsson.test;
service Test {
rpc exchangeStream(stream Message) returns (stream Message) {}
}
message Message {
string payload = 1;
}
Thanks, Oliver
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
Feel free to use
AdvancedTlsX509KeyManager
andAdvancedTlsX509TrustManager
. Those can do the polling and swap-out for you.FYI, you can also use
TlsChannelCredentials
andTlsServerCredentials
these days. They are stable APIs (unlike the Netty-based APIs). gRFC L74 has details about using channel credential API.Fair point on the external server’s clock being in the past, you’re right that almost all “permanent” failures are extremely difficult to identify as such given the high dependence on some other system.
For now I think internally we’ll be sticking with polling the security material on disk to watch for re-issue and rebuild the underlying SSL context when appropriate, manually implementing a sensible retry back off mechanism is probably not worth the CPU cycles spent doing the modified timestamp check on 3 files at some reasonably low frequency (for us, for now).
Thanks for the explanation, I’m going to close this issue here, take it easy.