question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consul: lock is lost on 7-10 s network issues

See original GitHub issue

I have Patroni and Consul Server on the same server db1 (192.168.241.17), and the same setup for slave server db2 (192.168.241.18). db1 has a lock. If I have short network issues on db1, then Consul after 7 seconds removes this node from the list, and deletes all the locks, and failover happens.

How can I improve the situation and increase timeout to, say, 30-60 s?

I’ve asked question in Consul maillist, and they suggested to associate Session with a TTL based check, and not to bind to “serfHealth”. So, lock will not be lost, if node db1 will be removed.

Consul on db2:

2017/09/13 17:29:15 [ERR] memberlist: Failed fallback ping: read tcp 192.168.241.18:34928->192.168.241.17:8301: i/o timeout
2017/09/13 17:29:15 [INFO] memberlist: Suspect db1.aes128.com has failed, no acks received
2017/09/13 17:29:18 [INFO] memberlist: Marking db1.aes128.com as failed, suspect timeout reached (2 peer confirmations)
2017/09/13 17:29:18 [INFO] serf: EventMemberFailed: db1.aes128.com 192.168.241.17
2017/09/13 17:29:18 [INFO] consul: Removing LAN server db1.aes128.com (Addr: tcp/192.168.241.17:8300) (DC: dc1)
2017/09/13 17:29:19 [ERR] memberlist: Failed fallback ping: read tcp 192.168.241.18:34941->192.168.241.17:8301: i/o timeout
2017/09/13 17:29:19 [INFO] memberlist: Suspect db1.aes128.com has failed, no acks received
2017/09/13 17:29:24 [INFO] serf: EventMemberLeave (forced): db1.aes128.com 192.168.241.17
2017/09/13 17:29:24 [INFO] consul: Removing LAN server db1.aes128.com (Addr: tcp/192.168.241.17:8300) (DC: dc1)

Patroni on db2:

2017-09-13 17:29:14,488 INFO: Lock owner: db1; I am db2
2017-09-13 17:29:14,488 INFO: does not have lock
2017-09-13 17:29:14,504 INFO: no action.  i am a secondary and i am following a leader
2017-09-13 17:29:20,761 WARNING: request failed: GET http://192.168.241.17:8008/patroni (HTTPConnect
ionPool(host='192.168.241.17', port=8008): Read timed out. (read timeout=2))
server signaled
server promoting
2017-09-13 17:29:20,979 INFO: cleared rewind state after becoming the leader
2017-09-13 17:29:21,006 INFO: promoted self to leader by acquiring session lock
2017-09-13 17:29:28,732 INFO: Lock owner: db2; I am db2
2017-09-13 17:29:28,803 INFO: no action.  i am the leader with the lock

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
CyberDem0ncommented, Sep 15, 2017

You definitely have some network problems, but Patroni is doing the best it could do. I don’t know very much about Consul configuration and operations, but in Patroni you can try to bump ttl and retry_timeout values.

For example you can set ttl=60 and retry_timeout=25. General rule is: loop_wait+2*retry_timeout <= ttl. loop_wait by default is 10 seconds.

You can change these parameters with patronictl:

$ patronictl edit-config -s 'ttl=60' -s 'retry_timeout=25' your_cluster_name

In this case leader key will expire only after 60 seconds and Patroni will retry consul requests during 25 seconds. That means if network is restored during 25 seconds master will not be demoted.

0reactions
Vanavcommented, Dec 18, 2017

I’ve done some tests, to keep leader lock till TTL, following requirements should met:

  1. New setting checks for Consul in patroni.yml should be empty:
consul:
  host: localhost:8500
  checks: []
  1. There should be no Consul Server on Patroni nodes. The preferred configuration is to run Consul Client on Patroni node, this client will own a leader lock without serfHealth, and this lock will survive if this Consul Client will be offline for a long time.

Alternative configuration (not tested) is to connect to Consul on other node (not localhost). But Consul prefers to run Client on every node.

Incorrect configuration: if you run one of Consul Servers on the same node as Patroni and Postgres, then lock will be lost in 5 seconds after this node will go offline.

Notice: Consul v1.0.0 and v.1.0.1 had a bug preventing creation a session without serfHealth (https://github.com/hashicorp/consul/issues/3732).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Classic networking issues - Consul - HashiCorp Discuss
Hey, we have a Consul deployment in k8s (3 worker nodes) with 3 consul agents and 3 consul servers. It's deployed by helm...
Read more >
Hazelcast IMDG Enterprise Release Notes
Hazelcast IMDG Enterprise Edition Release Notes lists the new features, enhancements and fixed issues for each Hazelcast IMDG release.
Read more >
Vendor Name - ocp - DC.gov
NET. $832,301.00. A.J. BOGGS AND COMPANY. 4660 SOUTH HAGADORN ROADSUITE 290 ... 10 G ST NE STE 710STE 710 WASHINGTON DC ... MONARCH...
Read more >
Form S-1 - SEC.gov
Although we do not generate revenue directly from users or platform partners, we benefit from network effects where more activity on Twitter results...
Read more >
Quantum Scalar i6000 Users's Guide
Accelerate service issue resolution with these exclusive Quantum. StorageCare services: ... indicates either loss of connectivity or drive failure.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found