question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NATS client doesn't reconnect on NATS reboot (LB setup)

See original GitHub issue

Hello,

I’ve encountered a problem: When restarting a NATS cluster, the NATS python client doesn’t reconnect by itself. This makes it really annoying, because all our services have to be restarted then…

Maybe it’s because our NATS server is using tls because I get the following output on the subscriber:

Error: <class 'nats.aio.errors.ErrStaleConnection'>

Our config looks like this:

 # PID file shared with configuration reloader.
 pid_file: "/var/run/nats/nats.pid"
 listen: 0.0.0.0:4222 
                      
 ###############      
 #             #      
 # Monitoring  #      
 #             #      
 ###############      
 http: 8222           
 server_name: $POD_NAME
 #####################
 #                   #
 # TLS Configuration #
 #                   #
 #####################
                      
 tls {                
    cert_file: /etc/nats-certs/clients/nats-client-tls/tls.crt
    key_file: /etc/nats-certs/clients/nats-client-tls/tls.key
 }                    
                      
                      
 ###################################
 #                                 #
 # NATS Full Mesh Clustering Setup #
 #                                 #
 ###################################
 cluster {
   port: 6222

   routes = [         
     nats://nats-0.nats.testing.svc:6222,nats://nats-1.nats.testing.svc:6222,nats://nats-2.nats.testing.svc:6222,
   ]
   cluster_advertise: $CLUSTER_ADVERTISE
   no_advertise: true

   connect_retries: 30
 }

  ##################
  #                #
  # Authorization  #
  #                #
  ##################
  accounts: {
      ...
  }

  system_account: SYS

How to reproduce:

Cluster setup

The server is running on a k8s cluster using the existing helm chart: The auth config is a bit tweaked, resulting in the above config file (just loading the accounts.conf file into the configmap). We are using Nkeys to authenticate.

nats:
  image: nats:2.1.7-alpine3.11
  pullPolicy: IfNotPresent

  externalAccess: false
  advertise: false

  serviceAccount: "nats-server"

  connectRetries: 30

  pingInterval:

  # Server settings.
  limits:
    maxConnections:
    maxSubscriptions:
    # maxControlLine: 512
    # maxPayload: 65536

    writeDeadline: "2s"
    maxPending:
    maxPings:
    lameDuckDuration:

  logging:
    debug: false
    trace: false
    logtime:
    connectErrorReports:
    reconnectErrorReports:

  tls:
    secret:
      name: nats-client-tls
    # ca: "ca.crt"
    cert: "tls.crt"
    key: "tls.key"

nameOverride: ""
imagePullSecrets: []


securityContext: null

affinity: {}

podAnnotations: {}

cluster:
  enabled: true
  replicas: 3
  noAdvertise: true


leafnodes:
  enabled: false


gateway:
  enabled: false
  name: 'default'

bootconfig:
  image: connecteverything/nats-boot-config:0.5.2
  pullPolicy: IfNotPresent


reloader:
  enabled: true
  image: connecteverything/nats-server-config-reloader:0.6.0
  pullPolicy: IfNotPresent

# Authentication setup
auth:
  enabled: true

  resolver:
    type: memory
    accountsFileName: accounts.conf

Simple subscriber using the nats.py client

import asyncio
import ssl
from nats.aio.client import Client as NATS
from nats.aio.errors import ErrConnectionClosed, ErrTimeout, ErrNoServers

SUBJECT = "test"
SEED = "/path/to/nkey/seed"
CLUSTER_ADDRESS = "nats.cluster.address:4222"

async def run(loop):
   
    async def message_handler(msg):
        subject = msg.subject
        reply = msg.reply
        data = msg.data.decode()
        print("Received a message on '{subject} {reply}': {data}".format(
            subject=subject, reply=reply, data=data))

    async def error_cb(e):
        print("Error:", e)

    nc = NATS()
    context = ssl.create_default_context()
    await nc.connect(CLUSTER_ADDRESS, tls=context, nkeys_seed=SEED, error_cb=error_cb)

    await nc.subscribe(SUBJECT, cb=message_handler)
    
if __name__ == '__main__':
    loop = asyncio.new_event_loop()
    loop.create_task(run(loop))
    loop.run_forever()

The problem

Output when sending a message on the subject:

Received a message on 'test ': test-message

Then during the restart we get this error:

Error: <class 'nats.aio.errors.ErrStaleConnection'>

After the restart, when I publish something on the same subject, the subscriber does not get any messages.

The server restart happens in under two minutes (which is the default reconnect time for the nats client). But I still need to restart all our services to reconnect to the nats server.

Thanks in advance for the help!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
wallyqscommented, Nov 12, 2020

Apologies for the delay, found the issue and client with the hot-fix is available here now: https://github.com/nats-io/nats.py/releases/tag/v0.11.4

0reactions
SimonVHBcommented, Oct 28, 2020

@wallyqs just tested, the reconnect does work from within the cluster! Almost all of our services do run in a different K8S cluster, running in a custom container based on python:3.8-alpine3.11.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Automatic Reconnections - NATS Docs
Upon re-connection the client library will automatically re-establish all the subscriptions, there is nothing for the application programmer to do.
Read more >
Troubleshoot your Network Load Balancer
The following information can help you troubleshoot issues with your Network Load Balancer. A registered target is not in service. If a target...
Read more >
Design virtual networks with NAT gateway - Azure
Learn how to design virtual networks that use Network Address Translation (NAT) gateway resources.
Read more >
nats - npm
0.1:4222 . If the connection is dropped, the client will attempt to reconnect. You can customize the server you want to connect to...
Read more >
nats 0.15.1 - Artifact Hub
NATS Server · helm repo add nats https://nats-io.github.io/k8s/helm/charts/ helm install my-nats nats/nats · nats: image: nats:2.7. · nats: # The number of connect...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found