Duplicate meta entries --> fail
See original GitHub issueI’m having trouble parsing attributes for this page:
https://cosmonaut.blog/2019/02/20/no-bernie/
This might very much be my non-existent JS/CSS skills, so feel free to close and sorry for the disturbance.
The problem I have is with the lead_image_url
selectors. The “default” (for most extractors) for this one would be [['meta[property="og:image"]', 'content']]
or [['meta[name="twitter:image"]','value']]
, but both of those, when executed, return two near-identical entries, causing the whole thing to fall apart (because if I read the tutorial correctly, they’d need to return exactly one item).
The other idea would be to query the image directly from the page, using [['img.wp-post-image', 'src']]
, but this is an image with srcset
and so the result ends up being a concatenation with multiple URLs (each of which would be acceptable to me) which I cannot further process in the simple selector: [...]
setting.
Am I missing something here?
- Platform:
Linux my-desktop 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
- Mercury Parser Version: master (
2a3ade706dc445ecb09cce552b087c850d2cb817
)
Issue Analytics
- State:
- Created 5 years ago
- Comments:5
sorry, forgot to close this. thanks again!
Indeed, a selector must return only one match, and there are a couple of ways to handle this:
your idea of querying the image directly from the page is perfectly correct, and the
srcset
issue that you have mentioned has been addressed in #312 , which has been merged into master and should be included in the next package release;alternatively, and in other situations where a non-unique selector doesn’t exist, you can use a selector that accounts for the two matches by having it return the second match, while adding a fallback selector to match the first element in case the website’s HTML is changed to no longer have duplicate tags; so it could be something like: