One node is unable to join cluster
See original GitHub issueHello Team,
We are using 2 node patroni cluster which was working fine but yesterday we face some issue with etcd node and create another etcd node and point out the patroni node to new etcd node.
Now when we start patroni on first node its started successfully and able to join the cluster as master node. But when we start the patroni service on second node then its service status showing running but we are getting below error logs in patroni logs:
2019-11-06 13:14:40,674 INFO: Error communicating with PostgreSQL. Will try again later
2019-11-06 13:14:49,870 INFO: Lock owner: sql02; I am sql01
2019-11-06 13:14:49,871 INFO: Still starting up as a standby.
2019-11-06 13:14:49,873 INFO: Lock owner: sql02; I am sql01
2019-11-06 13:14:49,873 INFO: does not have lock
2019-11-06 13:14:49,873 INFO: establishing a new patroni connection to the postgres cluster
2019-11-06 13:14:50,724 INFO: establishing a new patroni connection to the postgres cluster
2019-11-06 13:14:50,741 WARNING: Retry got exception: 'connection problems'
2019-11-06 13:14:50,764 INFO: Error communicating with PostgreSQL. Will try again later
2019-11-06 13:14:59,861 INFO: Lock owner: sql02; I am sql01
2019-11-06 13:14:59,861 INFO: Still starting up as a standby.
2019-11-06 13:14:59,863 INFO: Lock owner: sql02; I am sql01
2019-11-06 13:14:59,863 INFO: does not have lock
2019-11-06 13:14:59,863 INFO: establishing a new patroni connection to the postgres cluster
2019-11-06 13:15:00,213 INFO: establishing a new patroni connection to the postgres cluster
2019-11-06 13:15:00,224 WARNING: Retry got exception: 'connection problems'
In syslog we are getting below logs:
Nov 6 13:15:39 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov 6 13:15:49 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov 6 13:15:59 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov 6 13:16:09 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov 6 13:16:19 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
Nov 6 13:16:29 sql01 patroni[32659]: 10.133.15.9:5432 - rejecting connections
The IP 10.133.15.9 is self node ip where we are facing issue. When we check the cluster status we are getting below:
# patronictl -c /etc/patroni.yml list
+---------+--------+--------------+--------+----------+----+-----------+-----------------+
| Cluster | Member | Host | Role | State | TL | Lag in MB | Pending restart |
+---------+--------+--------------+--------+----------+----+-----------+-----------------+
| promobi | sql01 | 10.133.15.9 | | starting | | unknown | * |
| promobi | sql02 | 10.133.17.76 | Leader | running | 27 | 0 | * |
+---------+--------+--------------+--------+----------+----+-----------+-----------------+
When we check the status of patroni service we are getting below:
● patroni.service - Runners to orchestrate a high-availability PostgreSQL
Loaded: loaded (/etc/systemd/system/patroni.service; disabled; vendor preset: enabled)
Active: active (running) since Wed 2019-11-06 13:02:00 UTC; 16min ago
Main PID: 32659 (patroni)
Tasks: 7
Memory: 250.6M
CPU: 16.685s
CGroup: /system.slice/patroni.service
├─32659 /usr/bin/python3 /usr/local/bin/patroni /etc/patroni.yml
├─32676 /usr/lib/postgresql/10/bin/postgres -D /var/lib/postgresql/10/main --config-file=/etc/postgresql/10/main/postgresql.conf --hot_standby=on --wal_leve
└─32678 postgres: promobi: startup process recovering 0000001B00002A45000000E4
Nov 06 13:18:40 sql01 postgres[2953]: [5-1] 2019-11-06 13:18:40.170 UTC [2953] postgres@postgres FATAL: the database system is starting up
Nov 06 13:18:40 sql01 postgres[2954]: [4-1] 2019-11-06 13:18:40.736 UTC [2954] [unknown]@[unknown] LOG: connection received: host=10.133.15.9 port=18301
Nov 06 13:18:40 sql01 postgres[2954]: [5-1] 2019-11-06 13:18:40.741 UTC [2954] postgres@postgres FATAL: the database system is starting up
Nov 06 13:18:40 sql01 postgres[2955]: [4-1] 2019-11-06 13:18:40.742 UTC [2955] [unknown]@[unknown] LOG: connection received: host=10.133.15.9 port=18303
Nov 06 13:18:40 sql01 postgres[2955]: [5-1] 2019-11-06 13:18:40.743 UTC [2955] postgres@postgres FATAL: the database system is starting up
Nov 06 13:18:42 sql01 postgres[32678]: [206-1] 2019-11-06 13:18:42.790 UTC [32678] LOG: contrecord is requested by 2A45/E4000028
Nov 06 13:18:43 sql01 postgres[2964]: [4-1] 2019-11-06 13:18:43.176 UTC [2964] [unknown]@[unknown] LOG: connection received: host=10.133.15.9 port=18305
Nov 06 13:18:43 sql01 postgres[2964]: [5-1] 2019-11-06 13:18:43.182 UTC [2964] postgres@postgres FATAL: the database system is starting up
Nov 06 13:18:43 sql01 postgres[2965]: [4-1] 2019-11-06 13:18:43.184 UTC [2965] [unknown]@[unknown] LOG: connection received: host=10.133.15.9 port=18307
Nov 06 13:18:43 sql01 postgres[2965]: [5-1] 2019-11-06 13:18:43.185 UTC [2965] postgres@postgres FATAL: the database system is starting up
Can you please help me what is the root cause here? Any help will be appreciated.
Thanks.
Issue Analytics
- State:
- Created 4 years ago
- Comments:9
@Tekchanddagar I second the @thepotatocannon 's question, as we have the same issue, today
I encountered this issue too. A
patronictl reinit
on the affected node fixed it, but it was disconcerting.