[BUG] Reruning epicli may fail for clustered Postgres
See original GitHub issueDescribe the bug Re run the deployment causing fail in postgres roles
See it in version 0.6.0. and expect same behaviour in 0.7.0 because that postgres role does not change.
To Reproduce Steps to reproduce the behavior:
Build a cluster with postgres with two nodes
Observe that vm-1 is set as primary and vm-0 is hot standby
Re run the deployment with nothing changes
epcli run failed when applying postgres role.
Expected behavior It should be no errors
Config files
OS (please complete the following information):
- OS: Ubuntu 18.04
Cloud Environment (please complete the following information):
- Cloud Provider MS Azure
Additional context
The reason is the that the role uses the condition groups['postgresql'][0] == inventory_hostname
to decide which host is primary. The first run the condition is resolved to vm-1.
However the second run it resolved to vm-0 and because vm-0 is already setup as standby the task failed.
Below is the log case
First run
https://abb-jenkins.duckdns.org:8080/view/Development/job/DEPLOY-de-cluster/433/console
it picks master is vm-1
02:23:07 INFO cli.engine.ansible.AnsibleCommand - TASK [postgresql : Check if master is already registered in repmgr] ************
02:23:07 INFO cli.engine.ansible.AnsibleCommand - skipping: [de-stdbase-postgresql-vm-0]
02:23:07 INFO cli.engine.ansible.AnsibleCommand - ok: [de-stdbase-postgresql-vm-1]
epicli postgres role in replication-repmgr-Debian.yml
# Master:
- name: Check if master is already registered in repmgr
become_user: postgres
shell: >-
set -o pipefail &&
{{ repmgr_bindir[ansible_os_family] }}/repmgr cluster show -f {{ repmgr_config_dir[ansible_os_family] }}/repmgr.conf | grep -i {{ inventory_hostname }} | grep -v standby
changed_when: false
register: is_master_already_registered
failed_when: is_master_already_registered.rc == 2
args:
executable: /bin/bash
when:
- groups['postgresql'][0] == inventory_hostname
Now re-run it.
https://abb-jenkins.duckdns.org:8080/view/Development/job/DEPLOY-de-cluster/434/console
06:15:30 INFO cli.engine.ansible.AnsibleCommand - TASK [postgresql : Check if master is already registered in repmgr] ************
06:15:30 INFO cli.engine.ansible.AnsibleCommand - skipping: [de-stdbase-postgresql-vm-1]
06:15:31 INFO cli.engine.ansible.AnsibleCommand - ok: [de-stdbase-postgresql-vm-0]
as u can see it picks up vm-0 now. and then it failed because vm-0 is not primary, it is vm-1
06:15:35 INFO cli.engine.ansible.AnsibleCommand - skipping: [de-stdbase-postgresql-vm-1]
06:15:36 INFO cli.engine.ansible.AnsibleCommand - fatal: [de-stdbase-postgresql-vm-0]: FAILED! => {"changed": true, "cmd": "/usr/bin/repmgr primary register -f /etc/postgresql/10/main/repmgr.conf --force --superuser=epi_repmgr_admin", "delta": "0:00:00.044363", "end": "2020-09-25 06:15:36.171462", "msg": "non-zero return code", "rc": 1, "start": "2020-09-25 06:15:36.127099", "stderr": "ERROR: server is in standby mode and cannot be registered as a primary", "stderr_lines": ["ERROR: server is in standby mode and cannot be registered as a primary"], "stdout": "", "stdout_lines": []}
There is 50% chance it is ok if the groups[‘postgresql’][0] points to vm-1
Thus the issues is not 100% reproducible and easily skipped/ignored.
Suggestion to fix.
We need to have a stable mechanism in selecting nodes especially for roles depending the order of nodes to make a decision such as postgres. I do believe kafka roles when making the node_id will suffer the same issues.
For Azure it may be easy by using the vm-name host patter (the last is a number) but it might not be portable across provider such as AWS. I don’t know how to hostname looks like in AWS.
If looking in the code AnsibleInventoryCreator.py
to add the group I found that it is a bit harder to fix from there due to the python return in iterations. So for now I don’t have any best way to deal with this.
I may need to look more into the teraform template to see the hostname rules it generates and maybe use the consistent hostname pattern matching.
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (12 by maintainers)
Top GitHub Comments
Lets try to sort it and see if resolved problem for me, any comments are welcome
APIProxy.py
I’m afraid that sorting hostnames in AWS is not a complete solution. It will work only with the assumption nobody will add new or remove old nodes from the cluster. It should be rather sorted using timestamp when VM was created or something similar, but not the hostname 🤔 Refering to this line: https://github.com/epiphany-platform/epiphany/pull/1706/files#diff-20056616cbf0a609d4a1ac1d280b8eeaR26