question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Reruning epicli may fail for clustered Postgres

See original GitHub issue

Describe the bug Re run the deployment causing fail in postgres roles

See it in version 0.6.0. and expect same behaviour in 0.7.0 because that postgres role does not change.

To Reproduce Steps to reproduce the behavior:

Build a cluster with postgres with two nodes

Observe that vm-1 is set as primary and vm-0 is hot standby

Re run the deployment with nothing changes

epcli run failed when applying postgres role.

Expected behavior It should be no errors

Config files

OS (please complete the following information):

  • OS: Ubuntu 18.04

Cloud Environment (please complete the following information):

  • Cloud Provider MS Azure

Additional context

The reason is the that the role uses the condition groups['postgresql'][0] == inventory_hostname to decide which host is primary. The first run the condition is resolved to vm-1.

However the second run it resolved to vm-0 and because vm-0 is already setup as standby the task failed.

Below is the log case

First run

https://abb-jenkins.duckdns.org:8080/view/Development/job/DEPLOY-de-cluster/433/console

it picks master is vm-1

02:23:07 INFO cli.engine.ansible.AnsibleCommand - TASK [postgresql : Check if master is already registered in repmgr] ************
02:23:07 INFO cli.engine.ansible.AnsibleCommand - skipping: [de-stdbase-postgresql-vm-0]
02:23:07 INFO cli.engine.ansible.AnsibleCommand - ok: [de-stdbase-postgresql-vm-1]

epicli postgres role in replication-repmgr-Debian.yml

# Master:
- name: Check if master is already registered in repmgr
  become_user: postgres
  shell: >-
    set -o pipefail &&
    {{ repmgr_bindir[ansible_os_family] }}/repmgr cluster show -f {{ repmgr_config_dir[ansible_os_family] }}/repmgr.conf | grep -i {{ inventory_hostname }} | grep -v standby
  changed_when: false
  register: is_master_already_registered
  failed_when: is_master_already_registered.rc == 2
  args:
    executable: /bin/bash
  when:
    - groups['postgresql'][0] == inventory_hostname

Now re-run it.

https://abb-jenkins.duckdns.org:8080/view/Development/job/DEPLOY-de-cluster/434/console

06:15:30 INFO cli.engine.ansible.AnsibleCommand - TASK [postgresql : Check if master is already registered in repmgr] ************
06:15:30 INFO cli.engine.ansible.AnsibleCommand - skipping: [de-stdbase-postgresql-vm-1]
06:15:31 INFO cli.engine.ansible.AnsibleCommand - ok: [de-stdbase-postgresql-vm-0]

as u can see it picks up vm-0 now. and then it failed because vm-0 is not primary, it is vm-1

06:15:35 INFO cli.engine.ansible.AnsibleCommand - skipping: [de-stdbase-postgresql-vm-1]
06:15:36 INFO cli.engine.ansible.AnsibleCommand - fatal: [de-stdbase-postgresql-vm-0]: FAILED! => {"changed": true, "cmd": "/usr/bin/repmgr primary register -f /etc/postgresql/10/main/repmgr.conf --force --superuser=epi_repmgr_admin", "delta": "0:00:00.044363", "end": "2020-09-25 06:15:36.171462", "msg": "non-zero return code", "rc": 1, "start": "2020-09-25 06:15:36.127099", "stderr": "ERROR: server is in standby mode and cannot be registered as a primary", "stderr_lines": ["ERROR: server is in standby mode and cannot be registered as a primary"], "stdout": "", "stdout_lines": []}

There is 50% chance it is ok if the groups[‘postgresql’][0] points to vm-1

Thus the issues is not 100% reproducible and easily skipped/ignored.

Suggestion to fix.

We need to have a stable mechanism in selecting nodes especially for roles depending the order of nodes to make a decision such as postgres. I do believe kafka roles when making the node_id will suffer the same issues.

For Azure it may be easy by using the vm-name host patter (the last is a number) but it might not be portable across provider such as AWS. I don’t know how to hostname looks like in AWS.

If looking in the code AnsibleInventoryCreator.py to add the group I found that it is a bit harder to fix from there due to the python return in iterations. So for now I don’t have any best way to deal with this.

I may need to look more into the teraform template to see the hostname rules it generates and maybe use the consistent hostname pattern matching.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (12 by maintainers)

github_iconTop GitHub Comments

3reactions
sunshine69commented, Sep 26, 2020

Lets try to sort it and see if resolved problem for me, any comments are welcome

APIProxy.py

    def get_ips_for_feature(self, component_key):
        look_for_public_ip = self.cluster_model.specification.cloud.use_public_ips
        cluster = cluster_tag(self.cluster_prefix, self.cluster_name)
        running_instances = self.run(self, f'az vm list-ip-addresses --ids $(az resource list --query "[?type==\'Microsoft.Compute/virtualMachines\' && tags.{component_key} == \'\' && tags.cluster == \'{cluster}\'].id" --output tsv)')
        result = []
        for instance in running_instances:
            if isinstance(instance, list):
                instance = instance[0]
            name = instance['virtualMachine']['name']
            if look_for_public_ip:
                ip = instance['virtualMachine']['network']['publicIpAddresses'][0]['ipAddress']
            else:
                ip = instance['virtualMachine']['network']['privateIpAddresses'][0]
            result.append(AnsibleHostModel(name, ip))
        result.sort(key=lambda x: x.name, reverse=False)
        return result
2reactions
sk4zuzucommented, Sep 29, 2020

I’m afraid that sorting hostnames in AWS is not a complete solution. It will work only with the assumption nobody will add new or remove old nodes from the cluster. It should be rather sorted using timestamp when VM was created or something similar, but not the hostname 🤔 Refering to this line: https://github.com/epiphany-platform/epiphany/pull/1706/files#diff-20056616cbf0a609d4a1ac1d280b8eeaR26

Read more comments on GitHub >

github_iconTop Results From Across the Web

BUG #15989: Cluster unable to open as hot standby after ...
BUG #15989: Cluster unable to open as hot standby after SIGKILL during exclusive backup ; PG Bug reporting form <noreply(at)postgresql(dot)org>.
Read more >
Connection to a postgres cluster with the pgAdmin failed ...
Register server creation of pgAdmin returning a error. The cluster exists and are ... Psql version:15.0. I can connect to server using psql....
Read more >
Failed to bootstrap cluster - Clone failed · Issue #1358 · ...
Can you create a backup manually and see it in the bucket? I'm getting this error. root@mydb-postgresql-0:/home/postgres# envdir ...
Read more >
Failed Postgres nodes re-entering the cluster might take a ...
If a failover of the SAS® Infrastructure Data Server (PostgreSQL database) in SAS® Viya® 3.5 occurs, you must fix the issue with the...
Read more >
PostgreSQL pods do not start with EnterpriseDB operator ...
The EnterpriseDB operator fails to start the PostgreSQL pods. ... EDB cluster object, the PostgreSQL operator pod might take a few minutes ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found