Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Reruning epicli may fail for clustered Postgres

See original GitHub issue

Describe the bug Re run the deployment causing fail in postgres roles

See it in version 0.6.0. and expect same behaviour in 0.7.0 because that postgres role does not change.

To Reproduce Steps to reproduce the behavior:

Build a cluster with postgres with two nodes

Observe that vm-1 is set as primary and vm-0 is hot standby

Re run the deployment with nothing changes

epcli run failed when applying postgres role.

Expected behavior It should be no errors

Config files

OS (please complete the following information):

OS: Ubuntu 18.04

Cloud Environment (please complete the following information):

Cloud Provider MS Azure

Additional context

The reason is the that the role uses the condition groups['postgresql'][0] == inventory_hostname to decide which host is primary. The first run the condition is resolved to vm-1.

However the second run it resolved to vm-0 and because vm-0 is already setup as standby the task failed.

Below is the log case

First run

https://abb-jenkins.duckdns.org:8080/view/Development/job/DEPLOY-de-cluster/433/console

it picks master is vm-1

02:23:07 INFO cli.engine.ansible.AnsibleCommand - TASK [postgresql : Check if master is already registered in repmgr] ************
02:23:07 INFO cli.engine.ansible.AnsibleCommand - skipping: [de-stdbase-postgresql-vm-0]
02:23:07 INFO cli.engine.ansible.AnsibleCommand - ok: [de-stdbase-postgresql-vm-1]

epicli postgres role in replication-repmgr-Debian.yml

# Master:
- name: Check if master is already registered in repmgr
  become_user: postgres
  shell: >-
    set -o pipefail &&
    {{ repmgr_bindir[ansible_os_family] }}/repmgr cluster show -f {{ repmgr_config_dir[ansible_os_family] }}/repmgr.conf | grep -i {{ inventory_hostname }} | grep -v standby
  changed_when: false
  register: is_master_already_registered
  failed_when: is_master_already_registered.rc == 2
  args:
    executable: /bin/bash
  when:
    - groups['postgresql'][0] == inventory_hostname

Now re-run it.

https://abb-jenkins.duckdns.org:8080/view/Development/job/DEPLOY-de-cluster/434/console

06:15:30 INFO cli.engine.ansible.AnsibleCommand - TASK [postgresql : Check if master is already registered in repmgr] ************
06:15:30 INFO cli.engine.ansible.AnsibleCommand - skipping: [de-stdbase-postgresql-vm-1]
06:15:31 INFO cli.engine.ansible.AnsibleCommand - ok: [de-stdbase-postgresql-vm-0]

as u can see it picks up vm-0 now. and then it failed because vm-0 is not primary, it is vm-1

06:15:35 INFO cli.engine.ansible.AnsibleCommand - skipping: [de-stdbase-postgresql-vm-1]
06:15:36 INFO cli.engine.ansible.AnsibleCommand - fatal: [de-stdbase-postgresql-vm-0]: FAILED! => {"changed": true, "cmd": "/usr/bin/repmgr primary register -f /etc/postgresql/10/main/repmgr.conf --force --superuser=epi_repmgr_admin", "delta": "0:00:00.044363", "end": "2020-09-25 06:15:36.171462", "msg": "non-zero return code", "rc": 1, "start": "2020-09-25 06:15:36.127099", "stderr": "ERROR: server is in standby mode and cannot be registered as a primary", "stderr_lines": ["ERROR: server is in standby mode and cannot be registered as a primary"], "stdout": "", "stdout_lines": []}

There is 50% chance it is ok if the groups[‘postgresql’][0] points to vm-1

Thus the issues is not 100% reproducible and easily skipped/ignored.

Suggestion to fix.

We need to have a stable mechanism in selecting nodes especially for roles depending the order of nodes to make a decision such as postgres. I do believe kafka roles when making the node_id will suffer the same issues.

For Azure it may be easy by using the vm-name host patter (the last is a number) but it might not be portable across provider such as AWS. I don’t know how to hostname looks like in AWS.

If looking in the code AnsibleInventoryCreator.py to add the group I found that it is a bit harder to fix from there due to the python return in iterations. So for now I don’t have any best way to deal with this.

I may need to look more into the teraform template to see the hostname rules it generates and maybe use the consistent hostname pattern matching.

Issue Analytics

State:
Created 3 years ago
Comments:13 (12 by maintainers)

Top GitHub Comments

3reactions

sunshine69commented, Sep 26, 2020

Lets try to sort it and see if resolved problem for me, any comments are welcome

APIProxy.py

    def get_ips_for_feature(self, component_key):
        look_for_public_ip = self.cluster_model.specification.cloud.use_public_ips
        cluster = cluster_tag(self.cluster_prefix, self.cluster_name)
        running_instances = self.run(self, f'az vm list-ip-addresses --ids $(az resource list --query "[?type==\'Microsoft.Compute/virtualMachines\' && tags.{component_key} == \'\' && tags.cluster == \'{cluster}\'].id" --output tsv)')
        result = []
        for instance in running_instances:
            if isinstance(instance, list):
                instance = instance[0]
            name = instance['virtualMachine']['name']
            if look_for_public_ip:
                ip = instance['virtualMachine']['network']['publicIpAddresses'][0]['ipAddress']
            else:
                ip = instance['virtualMachine']['network']['privateIpAddresses'][0]
            result.append(AnsibleHostModel(name, ip))
        result.sort(key=lambda x: x.name, reverse=False)
        return result

2reactions

sk4zuzucommented, Sep 29, 2020

I’m afraid that sorting hostnames in AWS is not a complete solution. It will work only with the assumption nobody will add new or remove old nodes from the cluster. It should be rather sorted using timestamp when VM was created or something similar, but not the hostname 🤔 Refering to this line: https://github.com/epiphany-platform/epiphany/pull/1706/files#diff-20056616cbf0a609d4a1ac1d280b8eeaR26