Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

One node is unable to join cluster

See original GitHub issue

Hello Team,

We are using 2 node patroni cluster which was working fine but yesterday we face some issue with etcd node and create another etcd node and point out the patroni node to new etcd node.

Now when we start patroni on first node its started successfully and able to join the cluster as master node. But when we start the patroni service on second node then its service status showing running but we are getting below error logs in patroni logs:

2019-11-06 13:14:40,674 INFO: Error communicating with PostgreSQL. Will try again later
2019-11-06 13:14:49,870 INFO: Lock owner: sql02; I am sql01
2019-11-06 13:14:49,871 INFO: Still starting up as a standby.
2019-11-06 13:14:49,873 INFO: Lock owner: sql02; I am sql01
2019-11-06 13:14:49,873 INFO: does not have lock
2019-11-06 13:14:49,873 INFO: establishing a new patroni connection to the postgres cluster
2019-11-06 13:14:50,724 INFO: establishing a new patroni connection to the postgres cluster
2019-11-06 13:14:50,741 WARNING: Retry got exception: 'connection problems'
2019-11-06 13:14:50,764 INFO: Error communicating with PostgreSQL. Will try again later
2019-11-06 13:14:59,861 INFO: Lock owner: sql02; I am sql01
2019-11-06 13:14:59,861 INFO: Still starting up as a standby.
2019-11-06 13:14:59,863 INFO: Lock owner: sql02; I am sql01
2019-11-06 13:14:59,863 INFO: does not have lock
2019-11-06 13:14:59,863 INFO: establishing a new patroni connection to the postgres cluster
2019-11-06 13:15:00,213 INFO: establishing a new patroni connection to the postgres cluster
2019-11-06 13:15:00,224 WARNING: Retry got exception: 'connection problems'

In syslog we are getting below logs:

Nov  6 13:15:39 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov  6 13:15:49 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov  6 13:15:59 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov  6 13:16:09 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov  6 13:16:19 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov  6 13:16:29 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections

The IP 10.133.15.9 is self node ip where we are facing issue. When we check the cluster status we are getting below:

# patronictl -c /etc/patroni.yml list
+---------+--------+--------------+--------+----------+----+-----------+-----------------+
| Cluster | Member |     Host     |  Role  |  State   | TL | Lag in MB | Pending restart |
+---------+--------+--------------+--------+----------+----+-----------+-----------------+
| promobi | sql01  | 10.133.15.9  |        | starting |    |   unknown |        *        |
| promobi | sql02  | 10.133.17.76 | Leader | running  | 27 |         0 |        *        |
+---------+--------+--------------+--------+----------+----+-----------+-----------------+

When we check the status of patroni service we are getting below:

● patroni.service - Runners to orchestrate a high-availability PostgreSQL
   Loaded: loaded (/etc/systemd/system/patroni.service; disabled; vendor preset: enabled)
   Active: active (running) since Wed 2019-11-06 13:02:00 UTC; 16min ago
 Main PID: 32659 (patroni)
    Tasks: 7
   Memory: 250.6M
      CPU: 16.685s
   CGroup: /system.slice/patroni.service
           ├─32659 /usr/bin/python3 /usr/local/bin/patroni /etc/patroni.yml
           ├─32676 /usr/lib/postgresql/10/bin/postgres -D /var/lib/postgresql/10/main --config-file=/etc/postgresql/10/main/postgresql.conf --hot_standby=on --wal_leve
           └─32678 postgres: promobi: startup process   recovering 0000001B00002A45000000E4                                                                            

Nov 06 13:18:40 sql01 postgres[2953]: [5-1] 2019-11-06 13:18:40.170 UTC [2953] postgres@postgres FATAL:  the database system is starting up
Nov 06 13:18:40 sql01 postgres[2954]: [4-1] 2019-11-06 13:18:40.736 UTC [2954] [unknown]@[unknown] LOG:  connection received: host=10.133.15.9 port=18301
Nov 06 13:18:40 sql01 postgres[2954]: [5-1] 2019-11-06 13:18:40.741 UTC [2954] postgres@postgres FATAL:  the database system is starting up
Nov 06 13:18:40 sql01 postgres[2955]: [4-1] 2019-11-06 13:18:40.742 UTC [2955] [unknown]@[unknown] LOG:  connection received: host=10.133.15.9 port=18303
Nov 06 13:18:40 sql01 postgres[2955]: [5-1] 2019-11-06 13:18:40.743 UTC [2955] postgres@postgres FATAL:  the database system is starting up
Nov 06 13:18:42 sql01 postgres[32678]: [206-1] 2019-11-06 13:18:42.790 UTC [32678] LOG:  contrecord is requested by 2A45/E4000028
Nov 06 13:18:43 sql01 postgres[2964]: [4-1] 2019-11-06 13:18:43.176 UTC [2964] [unknown]@[unknown] LOG:  connection received: host=10.133.15.9 port=18305
Nov 06 13:18:43 sql01 postgres[2964]: [5-1] 2019-11-06 13:18:43.182 UTC [2964] postgres@postgres FATAL:  the database system is starting up
Nov 06 13:18:43 sql01 postgres[2965]: [4-1] 2019-11-06 13:18:43.184 UTC [2965] [unknown]@[unknown] LOG:  connection received: host=10.133.15.9 port=18307
Nov 06 13:18:43 sql01 postgres[2965]: [5-1] 2019-11-06 13:18:43.185 UTC [2965] postgres@postgres FATAL:  the database system is starting up

Can you please help me what is the root cause here? Any help will be appreciated.

Thanks.