question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

One node is unable to join cluster

See original GitHub issue

Hello Team,

We are using 2 node patroni cluster which was working fine but yesterday we face some issue with etcd node and create another etcd node and point out the patroni node to new etcd node.

Now when we start patroni on first node its started successfully and able to join the cluster as master node. But when we start the patroni service on second node then its service status showing running but we are getting below error logs in patroni logs:

2019-11-06 13:14:40,674 INFO: Error communicating with PostgreSQL. Will try again later
2019-11-06 13:14:49,870 INFO: Lock owner: sql02; I am sql01
2019-11-06 13:14:49,871 INFO: Still starting up as a standby.
2019-11-06 13:14:49,873 INFO: Lock owner: sql02; I am sql01
2019-11-06 13:14:49,873 INFO: does not have lock
2019-11-06 13:14:49,873 INFO: establishing a new patroni connection to the postgres cluster
2019-11-06 13:14:50,724 INFO: establishing a new patroni connection to the postgres cluster
2019-11-06 13:14:50,741 WARNING: Retry got exception: 'connection problems'
2019-11-06 13:14:50,764 INFO: Error communicating with PostgreSQL. Will try again later
2019-11-06 13:14:59,861 INFO: Lock owner: sql02; I am sql01
2019-11-06 13:14:59,861 INFO: Still starting up as a standby.
2019-11-06 13:14:59,863 INFO: Lock owner: sql02; I am sql01
2019-11-06 13:14:59,863 INFO: does not have lock
2019-11-06 13:14:59,863 INFO: establishing a new patroni connection to the postgres cluster
2019-11-06 13:15:00,213 INFO: establishing a new patroni connection to the postgres cluster
2019-11-06 13:15:00,224 WARNING: Retry got exception: 'connection problems'

In syslog we are getting below logs:

Nov  6 13:15:39 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov  6 13:15:49 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov  6 13:15:59 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov  6 13:16:09 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov  6 13:16:19 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov  6 13:16:29 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections

The IP 10.133.15.9 is self node ip where we are facing issue. When we check the cluster status we are getting below:

# patronictl -c /etc/patroni.yml list
+---------+--------+--------------+--------+----------+----+-----------+-----------------+
| Cluster | Member |     Host     |  Role  |  State   | TL | Lag in MB | Pending restart |
+---------+--------+--------------+--------+----------+----+-----------+-----------------+
| promobi | sql01  | 10.133.15.9  |        | starting |    |   unknown |        *        |
| promobi | sql02  | 10.133.17.76 | Leader | running  | 27 |         0 |        *        |
+---------+--------+--------------+--------+----------+----+-----------+-----------------+

When we check the status of patroni service we are getting below:

● patroni.service - Runners to orchestrate a high-availability PostgreSQL
   Loaded: loaded (/etc/systemd/system/patroni.service; disabled; vendor preset: enabled)
   Active: active (running) since Wed 2019-11-06 13:02:00 UTC; 16min ago
 Main PID: 32659 (patroni)
    Tasks: 7
   Memory: 250.6M
      CPU: 16.685s
   CGroup: /system.slice/patroni.service
           ├─32659 /usr/bin/python3 /usr/local/bin/patroni /etc/patroni.yml
           ├─32676 /usr/lib/postgresql/10/bin/postgres -D /var/lib/postgresql/10/main --config-file=/etc/postgresql/10/main/postgresql.conf --hot_standby=on --wal_leve
           └─32678 postgres: promobi: startup process   recovering 0000001B00002A45000000E4                                                                            

Nov 06 13:18:40 sql01 postgres[2953]: [5-1] 2019-11-06 13:18:40.170 UTC [2953] postgres@postgres FATAL:  the database system is starting up
Nov 06 13:18:40 sql01 postgres[2954]: [4-1] 2019-11-06 13:18:40.736 UTC [2954] [unknown]@[unknown] LOG:  connection received: host=10.133.15.9 port=18301
Nov 06 13:18:40 sql01 postgres[2954]: [5-1] 2019-11-06 13:18:40.741 UTC [2954] postgres@postgres FATAL:  the database system is starting up
Nov 06 13:18:40 sql01 postgres[2955]: [4-1] 2019-11-06 13:18:40.742 UTC [2955] [unknown]@[unknown] LOG:  connection received: host=10.133.15.9 port=18303
Nov 06 13:18:40 sql01 postgres[2955]: [5-1] 2019-11-06 13:18:40.743 UTC [2955] postgres@postgres FATAL:  the database system is starting up
Nov 06 13:18:42 sql01 postgres[32678]: [206-1] 2019-11-06 13:18:42.790 UTC [32678] LOG:  contrecord is requested by 2A45/E4000028
Nov 06 13:18:43 sql01 postgres[2964]: [4-1] 2019-11-06 13:18:43.176 UTC [2964] [unknown]@[unknown] LOG:  connection received: host=10.133.15.9 port=18305
Nov 06 13:18:43 sql01 postgres[2964]: [5-1] 2019-11-06 13:18:43.182 UTC [2964] postgres@postgres FATAL:  the database system is starting up
Nov 06 13:18:43 sql01 postgres[2965]: [4-1] 2019-11-06 13:18:43.184 UTC [2965] [unknown]@[unknown] LOG:  connection received: host=10.133.15.9 port=18307
Nov 06 13:18:43 sql01 postgres[2965]: [5-1] 2019-11-06 13:18:43.185 UTC [2965] postgres@postgres FATAL:  the database system is starting up

Can you please help me what is the root cause here? Any help will be appreciated.

Thanks.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9

github_iconTop GitHub Comments

4reactions
tomasz-zylka-flukecommented, Jan 26, 2021

@Tekchanddagar I second the @thepotatocannon 's question, as we have the same issue, today

1reaction
adamcharnockcommented, Feb 7, 2022

I encountered this issue too. A patronictl reinit on the affected node fixed it, but it was disconcerting.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unable to join a node into a cluster - Windows Server
Symptoms. Consider the following scenario: · Cause. Cluster nodes communicate over User Datagram Protocol (UDP) port 3343. · Resolution. To fix ...
Read more >
How can I get my worker nodes to join my Amazon EKS cluster?
To get your worker nodes to join your Amazon EKS cluster, you must complete the following: Confirm that you have DNS support for...
Read more >
Unable to join nodes to the cluster - NetApp Knowledge Base
Enter the IP address of an interface on the private cluster network from the cluster you want to join: 169.254.160.16 Joining cluster at...
Read more >
Node is unable to join cluster while another ... - Veritas SORT
Node is unable to join cluster while another node is being ejected. A cluster that is currently fencing out (ejecting) a node from...
Read more >
Node can't join to cluster · Issue #1452 · kubernetes/kubeadm
you need to kubeadm init on the first control-plane node. then you kubeadm join the rest of your control-plane nodes with the same...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found