Manual failover failing in lab setup
See original GitHub issueHi, I am trying to set up a reference cluster consisting of 3 nodes with etcd and 2 of them with the DB. While I did manage to get the cluster itself up and running with auto failover working fine (at first glance), I do have issues performing a manual failover. After adding a bit of debug output (see below) it seems that Ganeti thinks the passive node is active already, therefore gets a wrong xlog_location and refuses to failover due to exceeded maximum replication lag. In reality this is not the case and there is absolutely no write operation in progress in the lab setup during failover.
Mar 28 17:14:46 centos7-test-andres01 patroni: 2017-03-28 17:14:46,791 INFO: received failover request with leader=db01 candidate=None scheduled_at=None
Mar 28 17:14:46 centos7-test-andres01 patroni: 2017-03-28 17:14:46,797 INFO: Got response from db02 http://127.0.0.1:8008/patroni: {"database_system_identifier": "6402202967751248833", "postmaster_start_time": "2017-03-28 17:13:
56.572 CEST", "xlog": {"location": 51059728}, "patroni": {"scope": "mycluster", "version": "1.2.3"}, "replication": [{"sync_state": "async", "sync_priority": 0, "client_addr": "172.27.167.62", "state": "streaming", "application_
name": "db02", "usename": "replicator"}], "state": "running", "role": "master", "server_version": 90506}
Mar 28 17:14:46 centos7-test-andres01 patroni: 2017-03-28 17:14:46,901 INFO: Lock owner: db01; I am db01
Mar 28 17:14:46 centos7-test-andres01 patroni: 2017-03-28 17:14:46,908 INFO: Got response from db02 http://127.0.0.1:8008/patroni: {"database_system_identifier": "6402202967751248833", "postmaster_start_time": "2017-03-28 17:13:
56.572 CEST", "xlog": {"location": 51059728}, "patroni": {"scope": "mycluster", "version": "1.2.3"}, "replication": [{"sync_state": "async", "sync_priority": 0, "client_addr": "172.27.167.62", "state": "streaming", "application_
name": "db02", "usename": "replicator"}], "state": "running", "role": "master", "server_version": 90506}
Mar 28 17:14:47 centos7-test-andres01 patroni: 2017-03-28 17:14:47,004 INFO: last_leader_operation = 51059728, lag = 51059728, xlog_location = False
Mar 28 17:14:47 centos7-test-andres01 patroni: 2017-03-28 17:14:47,005 INFO: maximum_lag_on_failover = 1024
Mar 28 17:14:47 centos7-test-andres01 patroni: 2017-03-28 17:14:47,005 INFO: Member db02 exceeds maximum replication lag
Mar 28 17:14:47 centos7-test-andres01 patroni: 2017-03-28 17:14:47,005 WARNING: manual failover: no healthy members found, failover is not possible
Mar 28 17:14:47 centos7-test-andres01 patroni: 2017-03-28 17:14:47,005 INFO: Cleaning up failover key
I am unsure if I am looking in the right direction or maybe just have some config issue I overlooked and would be happy for any advice! Thanks in advance!
Andres
Issue Analytics
- State:
- Created 6 years ago
- Comments:5
Top GitHub Comments
Here is your problem:
Restapi configuration is wrong. Patroni uses REST API for communication between nodes. When it will try to access 127.0.0.1:8008 - it will connect to itself instead of another node.
Just change it to:
and it will work.
It doesn’t check config. It tries to access node db02 via REST API and make sure it is healthy. REST API endpoint address is written into ETCD by node db02.