Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Manual failover failing in lab setup

See original GitHub issue

Hi, I am trying to set up a reference cluster consisting of 3 nodes with etcd and 2 of them with the DB. While I did manage to get the cluster itself up and running with auto failover working fine (at first glance), I do have issues performing a manual failover. After adding a bit of debug output (see below) it seems that Ganeti thinks the passive node is active already, therefore gets a wrong xlog_location and refuses to failover due to exceeded maximum replication lag. In reality this is not the case and there is absolutely no write operation in progress in the lab setup during failover.

Mar 28 17:14:46 centos7-test-andres01 patroni: 2017-03-28 17:14:46,791 INFO: received failover request with leader=db01 candidate=None scheduled_at=None
Mar 28 17:14:46 centos7-test-andres01 patroni: 2017-03-28 17:14:46,797 INFO: Got response from db02 http://127.0.0.1:8008/patroni: {"database_system_identifier": "6402202967751248833", "postmaster_start_time": "2017-03-28 17:13:
56.572 CEST", "xlog": {"location": 51059728}, "patroni": {"scope": "mycluster", "version": "1.2.3"}, "replication": [{"sync_state": "async", "sync_priority": 0, "client_addr": "172.27.167.62", "state": "streaming", "application_
name": "db02", "usename": "replicator"}], "state": "running", "role": "master", "server_version": 90506}
Mar 28 17:14:46 centos7-test-andres01 patroni: 2017-03-28 17:14:46,901 INFO: Lock owner: db01; I am db01
Mar 28 17:14:46 centos7-test-andres01 patroni: 2017-03-28 17:14:46,908 INFO: Got response from db02 http://127.0.0.1:8008/patroni: {"database_system_identifier": "6402202967751248833", "postmaster_start_time": "2017-03-28 17:13:
56.572 CEST", "xlog": {"location": 51059728}, "patroni": {"scope": "mycluster", "version": "1.2.3"}, "replication": [{"sync_state": "async", "sync_priority": 0, "client_addr": "172.27.167.62", "state": "streaming", "application_
name": "db02", "usename": "replicator"}], "state": "running", "role": "master", "server_version": 90506}
Mar 28 17:14:47 centos7-test-andres01 patroni: 2017-03-28 17:14:47,004 INFO: last_leader_operation = 51059728, lag = 51059728, xlog_location = False
Mar 28 17:14:47 centos7-test-andres01 patroni: 2017-03-28 17:14:47,005 INFO: maximum_lag_on_failover = 1024
Mar 28 17:14:47 centos7-test-andres01 patroni: 2017-03-28 17:14:47,005 INFO: Member db02 exceeds maximum replication lag
Mar 28 17:14:47 centos7-test-andres01 patroni: 2017-03-28 17:14:47,005 WARNING: manual failover: no healthy members found, failover is not possible
Mar 28 17:14:47 centos7-test-andres01 patroni: 2017-03-28 17:14:47,005 INFO: Cleaning up failover key

I am unsure if I am looking in the right direction or maybe just have some config issue I overlooked and would be happy for any advice! Thanks in advance!

Andres

Issue Analytics

State:
Created 6 years ago
Comments:5

Top GitHub Comments

2reactions

CyberDem0ncommented, Apr 7, 2017

Here is your problem:

restapi:
  listen: 0.0.0.0:8008
  connect_address: 127.0.0.1:8008

restapi:
  listen: 127.0.0.1:8008
  connect_address: 127.0.0.1:8008

Restapi configuration is wrong. Patroni uses REST API for communication between nodes. When it will try to access 127.0.0.1:8008 - it will connect to itself instead of another node.

Just change it to:

restapi: # db01
  listen: 172.27.167.61:5432
  connect_address: 172.27.167.61:5432

restapi: # db02
  listen: 172.27.167.62:5432
  connect_address: 172.27.167.62:5432

and it will work.

0reactions

CyberDem0ncommented, Apr 7, 2017

restapi: # db01
  listen: 172.27.167.61:8008 # actually it could be 0.0.0.0:8008 in your case
  connect_address: 172.27.167.61:8008 # this setting tells how current node can be accessible by other nodes

Patroni also checks the config from db02 (via etcd)

It doesn’t check config. It tries to access node db02 via REST API and make sure it is healthy. REST API endpoint address is written into ETCD by node db02.

Top Results From Across the Web

Alwayson Manual Failover failed - TechNet - Microsoft

Hi,. we have SQL server 2016 , we tried to manually fail-over our availability group (has 2 replicas,1 azure , 1 local, no...

Solved: ISE 2.6 MNT manual failover error - Cisco Community

Primary/Secondary Hello, I have 2 ISE nodes setup as primary/secondary. When manually promoting ISE2 to primary MNT it is successfully promoted but the ......

Simulating a Multi Subnet cluster for setting up SQL Server ...

First thing first, we have to install failover clustering feature on all the three nodes which will be participating in our Always On ......

How to Install & Configure SQL Server 2019 Fail over Cluster ...

This video is about "VCP7-DCV 2020 VMware vSphere v7.0 Install ... step sql server failover cluster manual failover sql server cluster force ......

"vSphere HA virtual machine failed to failover" error in vCenter ...

This behavior can occur whenever a High Availability primary agent declares a host dead. However, the virtual machines continue to run without ...