question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

shrink-osd fails due to missing FSIDs

See original GitHub issue

Bug Report

What happened: I attempted to shrink a number of OSDs by using the shink-osd.yml playbook. However, the set_fact osd_hosts task fails as it attempts to extract a value for the non-existent key osd_fsid from the output of ceph osd find. Although my cluster was previously created with the lvm strategy, and my OSDs have FSIDs as confirmed by running ceph-volume lvm list on the host, the output of ceph osd find does not include a FSID. As a result, the playbook fails with a templating error;

TASK [set_fact osd_hosts] *******************************************************************************************************
Friday 22 February 2019  23:36:40 +0000 (0:00:00.074)       0:00:06.379 *******
ok: [localhost] => (item={'_ansible_parsed': True, 'stderr_lines': [], '_ansible_item_result': True, u'end': u'2019-02-22 23:36:40.459355', '_ansible_no_log': False, '_ansible_delegated_vars': {'ansible_delegated_host': u'allmight.fc.kj', 'ansible_host': u'allmight.fc.kj'}, u'cmd': [u'ceph', u'--cluster', u'ceph', u'osd', u'find', u'0'], u'rc': 0, u'stdout': u'{\n    "osd": 0,\n    "ip": "10.1.15.21:6803/1951",\n    "crush_location": {\n        "datacenter": "fc",\n        "host": "allmight",\n        "room": "office",\n        "root": "default"\n    }\n}', 'item': u'0', u'delta': u'0:00:00.314916', '_ansible_item_label': u'0', u'stderr': u'', u'changed': True, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': False, u'_raw_params': u' ceph --cluster ceph osd find 0', u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin': None}}, 'stdout_lines': [u'{', u'    "osd": 0,', u'    "ip": "10.1.15.21:6803/1951",', u'    "crush_location": {', u'        "datacenter": "fc",', u'        "host": "allmight",', u'        "room": "office",', u'        "root": "default"', u'    }', u'}'], u'start': u'2019-02-22 23:36:40.144439', '_ansible_ignore_errors': None, 'failed': False})
fatal: [localhost]: FAILED! => {"msg": "Unexpected templating type error occurred on ({{ osd_hosts | default([]) + [ (item.stdout | from_json).crush_location.host, (item.stdout | from_json).osd_fsid ] }}): coercing to Unicode: need string or buffer, list found"}

Output of ceph osd find when run on the host - osd_fsid key is missing;

ceph --cluster ceph osd find 0
{
    "osd": 0,
    "ip": "10.1.15.21:6803/1951",
    "crush_location": {
        "datacenter": "fc",
        "host": "allmight",
        "room": "office",
        "root": "default"
    }
}

What you expected to happen: The playbook to execute and remove the OSDs specified.

How to reproduce it (minimal and precise):

  1. Check out the v3.2.7 tag.
  2. Copy the shrink-osd.yml playbook to the main directory.
  3. Execute the playbook against a Ceph Mimic cluster. Observe the playbook failing as the osd_fsid key is not present in the output of ceph osd find run by a previous task.

This issue looks to have been introduced by the backport of #3515 to v3.2 via #3530. Using the the shink-osd.yml playbook in the v3.2.5 tag appears to work without issue - the playbook executes without any errors and lsblk lists no LVM partitions on the disks previously used for the Ceph OSDs.

Environment:

  • OS (e.g. from /etc/os-release): Ubuntu 18.04.2 LTS
  • Kernel (e.g. uname -a): 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Docker version if applicable (e.g. docker version): N/A
  • Ansible version (e.g. ansible-playbook --version): 2.6.11
  • ceph-ansible version (e.g. git head or tag or stable branch): tag 3.2.7
  • Ceph version (e.g. ceph -v): ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
dotnwatcommented, Mar 4, 2019

I’m seeing that the releases are being cut now, so they should be arriving very soon

1reaction
dotnwatcommented, Mar 4, 2019

@dsavineau @leseb the change on the ceph side was merged into mimic and luminous a bit after the 13.2.4 release was cut. I’ll find out when a point release is going to be cut which should solve this issue after a monitor upgrade. Otherwise, maybe we can handle this out of band from the official playbook?

Read more comments on GitHub >

github_iconTop Results From Across the Web

1569413 – Add support to shrink-osd.yml ...
Hi Harish, what information do you need? I'm on PTO, but the work for this bug is upstream in ceph-ansible. Only the ability...
Read more >
Comparing 1c88c444a3...cd9fdde4e9 - ceph-ansible
The condition is missing an index here which makes the playbook failing. Typical error: ``` The conditional check 'not item.get('skipped', False)' failed.
Read more >
Red Hat Ceph Storage 3.1 Release Notes en US | PDF
Red Hat Software Collections is not formally related to ... failed because of a typo in the Ansible playbook which missed identifying NVMe ......
Read more >
Ceph分布式存储详解
[42091.375093] libceph: client14381 fsid f99134e9-c5fb-4917-b7e5-372ab4d6a8f0 ... missing codepage or helper program, or other error.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found