question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot re-connect to server after nats-streaming-server is restarted

See original GitHub issue

I’ve encountered the following issue while trying to reconnect to nats-streaming-server. Here are the steps to reproduce:

  1. nats-streaming-server is running
  2. the app (see demo code below) connects to nats-streaming-server and starts publishing messages
  3. the nats-streaming-server is killed
  4. on each subsequent Publish, STAN.Client.StanConnectionException: The NATS connection is reconnecting is printed
  5. after a while (server still not up), on each Publish, the app starts to receive STAN.Client.StanConnectionClosedException: Connection closed.
  6. I restart nats-streaming-server
  7. The app’s ReconnectedEventHandler is fired, detecting nats connection restore.
  8. But the app’s StanConnection never recovers and the application can never resume successful publishing.

What can I do in order to allow the app to recover publishing?

Running on OSX Catalina and using latest nuget packages available at the current time:

    <ItemGroup>
        <PackageReference Include="NATS.Client" Version="0.10.0" />
        <PackageReference Include="STAN.Client" Version="0.2.0" />
    </ItemGroup>

Code used for testing:

using System;
using System.Diagnostics;
using System.Threading;
using NATS.Client;
using STAN.Client;

namespace StanTest
{
    class Program
    {
        static void Main(string[] args)
        {
            var cf = new ConnectionFactory();
            var sf = new StanConnectionFactory();
            var natsConnection = cf.CreateConnection(GetOpts());

            var stanOpts = StanOptions.GetDefaultOptions();
            stanOpts.ConnectTimeout = 4000;
            stanOpts.NatsConn = natsConnection;
            stanOpts.PubAckWait = 40000;

            var stanConnection = sf.CreateConnection("test-cluster", "uniq123", stanOpts);
            var watch = Stopwatch.StartNew();

            while (true)
            {
                try
                {
                    stanConnection.Publish("test", new byte[0x1]);
                    Console.WriteLine("{0}. Published message", watch.Elapsed);
                }
                catch (Exception e)
                {
                    Console.WriteLine("{0} - Type:{1}. On Publish: exception message: {2}", watch.Elapsed, e.GetType(), e.Message);
                }
                finally
                {
                    Thread.Sleep(1000);
                }
            }
        }
        
        private static Options GetOpts()
        {
            var opts = ConnectionFactory.GetDefaultOptions();
            opts.Url = "nats://localhost:4222";
            opts.AllowReconnect = true;
            opts.PingInterval = 5000;
            opts.MaxPingsOut = 2;
            opts.MaxReconnect = Options.ReconnectForever;
            opts.ReconnectWait = 1000;
            opts.Timeout = 4000;

            opts.ServerDiscoveredEventHandler += (sender, args) => Console.WriteLine("NATS server discovered");

            opts.ReconnectedEventHandler +=
                (sender, args) =>  Console.WriteLine( "NATS server reconnected.");
            opts.ClosedEventHandler +=
                (sender, args) => Console.WriteLine("NATS connection closed");
            opts.DisconnectedEventHandler += (sender, args) =>
                Console.WriteLine("NATS connection disconnected");
            opts.AsyncErrorEventHandler +=
                (sender, args) => Console.WriteLine("NATS async error: {0}, Message={1}, Subject={2}", args.Conn.ConnectedUrl,
                    args.Error, args.Subscription.Subject);

            return opts;
        }
    }   
}

And here are the messages printed to console while running:

/usr/local/share/dotnet/dotnet /Users/robert/Sandbox/StanTest/bin/Debug/netcoreapp3.0/StanTest.dll
00:00:00.0237181. Published message
00:00:01.0443621. Published message
NATS connection disconnected
00:00:02.0483280 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:03.0528216 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:04.0580304 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:05.0612458 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:06.0655462 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:07.0696808 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:08.0721798 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:09.0763581 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:10.0803322 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:11.0852381 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:12.0898777 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:13.0904445 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:14.0946184 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:15.0953681 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:16.0963306 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:17.1017706 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:18.1047256 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:19.1077148 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:20.1114949 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:21.1162681 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:22.1206395 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:23.1238276 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
NATS server reconnected.
00:00:24.1287867 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:25.1324792 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:26.1373635 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:27.1406061 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:28.1452916 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:29.1501845 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
^C

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:30 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
shokri-navidcommented, Aug 22, 2021

I have encountered the same situation in version 0.3.0. when my Stan connection entered to Connection Closed Mode it never comes to connect state and all publish requests throw the Connection Close Exception.

this is my code to register stan Connection:

public static class Extensions
    {
        private static IStanConnection _stanConnection; 
        public static IServiceCollection AddStanClient(this IServiceCollection services, StanSettings settings)
        {

            var stanOptions = StanOptions.GetDefaultOptions();
            var natsOptions = ConnectionFactory.GetDefaultOptions();
            var stanClientId = $"{settings.ClientId}-{Guid.NewGuid()}";
            var stanClusterId = settings.ClusterId;
            natsOptions.Servers = settings.NatsServers.ToArray();
            natsOptions.AllowReconnect = true;
            natsOptions.MaxReconnect = Options.ReconnectForever;
            natsOptions.PingInterval = 2000;
            natsOptions.MaxPingsOut = 2;
            natsOptions.ReconnectWait = 1000;
            natsOptions.ReconnectBufferSize = Options.ReconnectBufferSizeUnbounded;
            natsOptions.Timeout = 4000;
             
            natsOptions.ClosedEventHandler += (object sender, ConnEventArgs arg) =>
            {
                Console.WriteLine($"connection lost on connection {arg.Conn.ConnectedId}" +
                                  $"{Environment.NewLine} {arg.Error.StackTrace}");
            };

            natsOptions.ReconnectedEventHandler += (object sender, ConnEventArgs arg) =>
            {
                stanOptions.NatsConn = arg.Conn;
                Console.WriteLine($"connection Recovered by connection id {arg.Conn.ConnectedId}");
            }; 
            
            
            stanOptions.NatsConn = new ConnectionFactory().CreateConnection(natsOptions);
            stanOptions.ConnectionLostEventHandler += (object sender, StanConnLostHandlerArgs arg) => {
                Console.WriteLine($"connection lost on connection {arg.Connection.NATSConnection.ConnectedId}" +
                                  $"{Environment.NewLine} {arg.ConnectionException.StackTrace}"); 
            };

            _stanConnection = new StanConnectionFactory().CreateConnection(stanClusterId, stanClientId, stanOptions); 
            services.AddSingleton<IStanConnection>( _stanConnection);

            return services;
        }
    }

and I use this connection like this :

static Task Main(string[] args)
        {
            var host = CreateHostBuilder(args).Build();

            var scope = host.Services.CreateScope();
            

            var watch = new Stopwatch();
            watch.Start();
            while (true)
            {
                var stanConnection = scope.ServiceProvider.GetRequiredService<IStanConnection>();
                try
                {
                    stanConnection.Publish("test", new byte[0x1]);
                    Console.WriteLine("{0}. Published message", watch.Elapsed);
                }
                catch (Exception e)
                {
                    Console.WriteLine("{0} - Type:{1}. On Publish: exception message: {2}", watch.Elapsed, e.GetType(), e.Message);
                }
                finally
                {
                    Thread.Sleep(1000);
                }
            }
            Console.WriteLine("Hello World!");

            return host.StartAsync();
        }


        static IHostBuilder CreateHostBuilder(string[] args) =>
            Host.CreateDefaultBuilder(args)
                .ConfigureServices((_, services) =>
                    services.AddStanClient(new StanSettings
                    {
                        ClientId = "uniq123",
                        ClusterId = "test-cluster",
                        NatsServers = new List<string>{"nats://localhost:4222"}
                    })); 

    }
    ```
    and this is the output of my code when I stop the Stan service and after a while, I start it again :
    
00:00:13.7476360. Published message
00:00:14.7542784. Published message
00:00:15.7602945. Published message
00:00:16.7658160. Published message
00:00:17.7862428. Published message
00:00:18.8041595. Published message
00:00:20.6005542 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:21.6487055 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:22.7012163 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:23.7444252 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:24.7905985 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:25.8348325 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:26.8816522 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:27.9295129 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:28.9737369 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:30.0154864 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:31.0607351 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:32.1057689 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:33.1495434 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:34.1919119 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:35.2404412 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:36.3016650 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:37.3464719 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:38.3887135 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:39.4637382 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:40.5152720 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:41.5576260 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
connection Recovered by connection id NDIVMUUHKJN644WRMAFRWW7GB5TD3ZPYVE2CIJWFMCLJDXCWIBDURO6L
00:00:42.6050295 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:43.6454686 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:44.6864267 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:45.7310466 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
1reaction
robertmirceacommented, Dec 28, 2019

Many thanks for your time spent on this issue! Is kind of odd and I can never imagine a practical situation in which I would not like that a STAN client to recover when ping failed but the enclosed NATS connection is alive. I was led by the impression that STAN connection should be long lived (and act as a singleton) similar to NATS connection object. In my prod app, I was registering the STAN connection as a singleton with the DI container. So, to conclude: you are telling me that:

Either:

  1. disable (set very high timeout/no of pings) ping on STAN connection, but leave it on for NATS connection
  2. or, recreate STAN connection every time when the underlying connection to NATS is closed.

Could you reconsider the architectural decision of not recovering STAN connection in case that NATS streaming server is running in embedded mode and STAN client using a not self-managed NATS connection and make somehow transparent for the STAN client how the NATS server is ran?

I will try to reproduce this test using the java streaming server client and see if the behaviour is similar.

Read more comments on GitHub >

github_iconTop Results From Across the Web

c# - How to renew a singleton closed Stan (nats streaming ...
When it loses the connection to the nats streaming server for 15 seconds it is in reconnecting mode and if the connection works...
Read more >
Resolving a NATS streaming cluster failed restart
If the NATS streaming cluster fails to restart automatically, delete the PVC and the failing pod so that a new pod can be...
Read more >
NATS streaming reconnecting to a subscription
I'm not managing to get clear on ordering guarantees provided by NATS in the context of resubscribing after a disconnection. The problem I'm...
Read more >
Client Connections - NATS Docs
This means that if the streaming server is stopped, all state is lost. On server restart, since no connection information is recovered, running...
Read more >
Investigating NATS issues - DataMiner Docs
Try a NATS reset; Check if new NATS connections can be established; Check the NAS and NATS logging; Check if port is already...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found