Consul: lock is lost on 7-10 s network issues
See original GitHub issueI have Patroni and Consul Server on the same server db1 (192.168.241.17), and the same setup for slave server db2 (192.168.241.18). db1 has a lock. If I have short network issues on db1, then Consul after 7 seconds removes this node from the list, and deletes all the locks, and failover happens.
How can I improve the situation and increase timeout to, say, 30-60 s?
I’ve asked question in Consul maillist, and they suggested to associate Session with a TTL based check, and not to bind to “serfHealth”. So, lock will not be lost, if node db1 will be removed.
Consul on db2:
2017/09/13 17:29:15 [ERR] memberlist: Failed fallback ping: read tcp 192.168.241.18:34928->192.168.241.17:8301: i/o timeout
2017/09/13 17:29:15 [INFO] memberlist: Suspect db1.aes128.com has failed, no acks received
2017/09/13 17:29:18 [INFO] memberlist: Marking db1.aes128.com as failed, suspect timeout reached (2 peer confirmations)
2017/09/13 17:29:18 [INFO] serf: EventMemberFailed: db1.aes128.com 192.168.241.17
2017/09/13 17:29:18 [INFO] consul: Removing LAN server db1.aes128.com (Addr: tcp/192.168.241.17:8300) (DC: dc1)
2017/09/13 17:29:19 [ERR] memberlist: Failed fallback ping: read tcp 192.168.241.18:34941->192.168.241.17:8301: i/o timeout
2017/09/13 17:29:19 [INFO] memberlist: Suspect db1.aes128.com has failed, no acks received
2017/09/13 17:29:24 [INFO] serf: EventMemberLeave (forced): db1.aes128.com 192.168.241.17
2017/09/13 17:29:24 [INFO] consul: Removing LAN server db1.aes128.com (Addr: tcp/192.168.241.17:8300) (DC: dc1)
Patroni on db2:
2017-09-13 17:29:14,488 INFO: Lock owner: db1; I am db2
2017-09-13 17:29:14,488 INFO: does not have lock
2017-09-13 17:29:14,504 INFO: no action. i am a secondary and i am following a leader
2017-09-13 17:29:20,761 WARNING: request failed: GET http://192.168.241.17:8008/patroni (HTTPConnect
ionPool(host='192.168.241.17', port=8008): Read timed out. (read timeout=2))
server signaled
server promoting
2017-09-13 17:29:20,979 INFO: cleared rewind state after becoming the leader
2017-09-13 17:29:21,006 INFO: promoted self to leader by acquiring session lock
2017-09-13 17:29:28,732 INFO: Lock owner: db2; I am db2
2017-09-13 17:29:28,803 INFO: no action. i am the leader with the lock
Issue Analytics
- State:
- Created 6 years ago
- Comments:6
Top Results From Across the Web
Classic networking issues - Consul - HashiCorp Discuss
Hey, we have a Consul deployment in k8s (3 worker nodes) with 3 consul agents and 3 consul servers. It's deployed by helm...
Read more >Hazelcast IMDG Enterprise Release Notes
Hazelcast IMDG Enterprise Edition Release Notes lists the new features, enhancements and fixed issues for each Hazelcast IMDG release.
Read more >Vendor Name - ocp - DC.gov
NET. $832,301.00. A.J. BOGGS AND COMPANY. 4660 SOUTH HAGADORN ROADSUITE 290 ... 10 G ST NE STE 710STE 710 WASHINGTON DC ... MONARCH...
Read more >Form S-1 - SEC.gov
Although we do not generate revenue directly from users or platform partners, we benefit from network effects where more activity on Twitter results...
Read more >Quantum Scalar i6000 Users's Guide
Accelerate service issue resolution with these exclusive Quantum. StorageCare services: ... indicates either loss of connectivity or drive failure.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
You definitely have some network problems, but Patroni is doing the best it could do. I don’t know very much about Consul configuration and operations, but in Patroni you can try to bump
ttl
andretry_timeout
values.For example you can set ttl=60 and retry_timeout=25. General rule is: loop_wait+2*retry_timeout <= ttl. loop_wait by default is 10 seconds.
You can change these parameters with patronictl:
In this case leader key will expire only after 60 seconds and Patroni will retry consul requests during 25 seconds. That means if network is restored during 25 seconds master will not be demoted.
I’ve done some tests, to keep leader lock till TTL, following requirements should met:
checks
for Consul in patroni.yml should be empty:Alternative configuration (not tested) is to connect to Consul on other node (not localhost). But Consul prefers to run Client on every node.
Incorrect configuration: if you run one of Consul Servers on the same node as Patroni and Postgres, then lock will be lost in 5 seconds after this node will go offline.
Notice: Consul v1.0.0 and v.1.0.1 had a bug preventing creation a session without serfHealth (https://github.com/hashicorp/consul/issues/3732).