Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consul: lock is lost on 7-10 s network issues

See original GitHub issue

I have Patroni and Consul Server on the same server db1 (192.168.241.17), and the same setup for slave server db2 (192.168.241.18). db1 has a lock. If I have short network issues on db1, then Consul after 7 seconds removes this node from the list, and deletes all the locks, and failover happens.

How can I improve the situation and increase timeout to, say, 30-60 s?

I’ve asked question in Consul maillist, and they suggested to associate Session with a TTL based check, and not to bind to “serfHealth”. So, lock will not be lost, if node db1 will be removed.

Consul on db2:

2017/09/13 17:29:15 [ERR] memberlist: Failed fallback ping: read tcp 192.168.241.18:34928->192.168.241.17:8301: i/o timeout
2017/09/13 17:29:15 [INFO] memberlist: Suspect db1.aes128.com has failed, no acks received
2017/09/13 17:29:18 [INFO] memberlist: Marking db1.aes128.com as failed, suspect timeout reached (2 peer confirmations)
2017/09/13 17:29:18 [INFO] serf: EventMemberFailed: db1.aes128.com 192.168.241.17
2017/09/13 17:29:18 [INFO] consul: Removing LAN server db1.aes128.com (Addr: tcp/192.168.241.17:8300) (DC: dc1)
2017/09/13 17:29:19 [ERR] memberlist: Failed fallback ping: read tcp 192.168.241.18:34941->192.168.241.17:8301: i/o timeout
2017/09/13 17:29:19 [INFO] memberlist: Suspect db1.aes128.com has failed, no acks received
2017/09/13 17:29:24 [INFO] serf: EventMemberLeave (forced): db1.aes128.com 192.168.241.17
2017/09/13 17:29:24 [INFO] consul: Removing LAN server db1.aes128.com (Addr: tcp/192.168.241.17:8300) (DC: dc1)

Patroni on db2:

2017-09-13 17:29:14,488 INFO: Lock owner: db1; I am db2
2017-09-13 17:29:14,488 INFO: does not have lock
2017-09-13 17:29:14,504 INFO: no action.  i am a secondary and i am following a leader
2017-09-13 17:29:20,761 WARNING: request failed: GET http://192.168.241.17:8008/patroni (HTTPConnect
ionPool(host='192.168.241.17', port=8008): Read timed out. (read timeout=2))
server signaled
server promoting
2017-09-13 17:29:20,979 INFO: cleared rewind state after becoming the leader
2017-09-13 17:29:21,006 INFO: promoted self to leader by acquiring session lock
2017-09-13 17:29:28,732 INFO: Lock owner: db2; I am db2
2017-09-13 17:29:28,803 INFO: no action.  i am the leader with the lock

Issue Analytics

State:
Created 6 years ago
Comments:6

Top GitHub Comments

1reaction

CyberDem0ncommented, Sep 15, 2017

You definitely have some network problems, but Patroni is doing the best it could do. I don’t know very much about Consul configuration and operations, but in Patroni you can try to bump ttl and retry_timeout values.

For example you can set ttl=60 and retry_timeout=25. General rule is: loop_wait+2*retry_timeout <= ttl. loop_wait by default is 10 seconds.

You can change these parameters with patronictl:

$ patronictl edit-config -s 'ttl=60' -s 'retry_timeout=25' your_cluster_name

In this case leader key will expire only after 60 seconds and Patroni will retry consul requests during 25 seconds. That means if network is restored during 25 seconds master will not be demoted.

0reactions

Vanavcommented, Dec 18, 2017

I’ve done some tests, to keep leader lock till TTL, following requirements should met:

New setting checks for Consul in patroni.yml should be empty:

consul:
  host: localhost:8500
  checks: []

There should be no Consul Server on Patroni nodes. The preferred configuration is to run Consul Client on Patroni node, this client will own a leader lock without serfHealth, and this lock will survive if this Consul Client will be offline for a long time.

Alternative configuration (not tested) is to connect to Consul on other node (not localhost). But Consul prefers to run Client on every node.

Incorrect configuration: if you run one of Consul Servers on the same node as Patroni and Postgres, then lock will be lost in 5 seconds after this node will go offline.

Notice: Consul v1.0.0 and v.1.0.1 had a bug preventing creation a session without serfHealth (https://github.com/hashicorp/consul/issues/3732).

Top Results From Across the Web

Classic networking issues - Consul - HashiCorp Discuss

Hey, we have a Consul deployment in k8s (3 worker nodes) with 3 consul agents and 3 consul servers. It's deployed by helm...

Hazelcast IMDG Enterprise Release Notes

Hazelcast IMDG Enterprise Edition Release Notes lists the new features, enhancements and fixed issues for each Hazelcast IMDG release.

Vendor Name - ocp - DC.gov

NET. $832,301.00. A.J. BOGGS AND COMPANY. 4660 SOUTH HAGADORN ROADSUITE 290 ... 10 G ST NE STE 710STE 710 WASHINGTON DC ... MONARCH...

Form S-1 - SEC.gov

Although we do not generate revenue directly from users or platform partners, we benefit from network effects where more activity on Twitter results...

Quantum Scalar i6000 Users's Guide

Accelerate service issue resolution with these exclusive Quantum. StorageCare services: ... indicates either loss of connectivity or drive failure.