Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Duplicate meta entries --> fail

See original GitHub issue

I’m having trouble parsing attributes for this page:

https://cosmonaut.blog/2019/02/20/no-bernie/

This might very much be my non-existent JS/CSS skills, so feel free to close and sorry for the disturbance. The problem I have is with the lead_image_url selectors. The “default” (for most extractors) for this one would be [['meta[property="og:image"]', 'content']] or [['meta[name="twitter:image"]','value']], but both of those, when executed, return two near-identical entries, causing the whole thing to fall apart (because if I read the tutorial correctly, they’d need to return exactly one item).

The other idea would be to query the image directly from the page, using [['img.wp-post-image', 'src']], but this is an image with srcset and so the result ends up being a concatenation with multiple URLs (each of which would be acceptable to me) which I cannot further process in the simple selector: [...] setting.

Am I missing something here?

Platform: Linux my-desktop 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Mercury Parser Version: master (2a3ade706dc445ecb09cce552b087c850d2cb817)

Issue Analytics

State:
Created 5 years ago
Comments:5

Top GitHub Comments

1reaction

black-puppydogcommented, Mar 18, 2019

sorry, forgot to close this. thanks again!

1reaction

toufic-mcommented, Mar 18, 2019

Indeed, a selector must return only one match, and there are a couple of ways to handle this:

your idea of querying the image directly from the page is perfectly correct, and the srcset issue that you have mentioned has been addressed in #312 , which has been merged into master and should be included in the next package release;
alternatively, and in other situations where a non-unique selector doesn’t exist, you can use a selector that accounts for the two matches by having it return the second match, while adding a fallback selector to match the first element in case the website’s HTML is changed to no longer have duplicate tags; so it could be something like:

  lead_image_url: {
    selectors: [
      ['meta[name="og:image"] ~ meta[name="og:image"]', 'value'], // this basically means: select the `meta[name="og:image"]` that is a subsequent sibling of a `meta[name="og:image"]`
      ['meta[name="og:image"]', 'value'], // if the first selector no longer works, then this meta property no longer has a duplicate and we can safely select the first one
    ],
  },