Clean bad HTML
See original GitHub issueThere are some cases of bad HTML that makes Scrapy (well, lxml really) to choke on the response content, and I was thinking it would make sense to add a CleanBadHtml middleware that could be optionally disabled.
I’ve just stumbled on an example case from a real website, where the response had something like this (real phone number edited) in its content:
text = u'<a href="tel:111\x00111\x001111">111-111-1111</a>'
The \x00
is interpreted as end of input by lxml, so the selector ends up stopping right there:
>>> parsel.Selector(text)
<Selector xpath=None data=u'<html><body><a href="tel:111"></a></body'>
>>> parsel.Selector(text).extract()
u'<html><body><a href="tel:111"></a></body></html>'
I’m not sure what’s the best place to fix this, but I think we gotta do something about it either in Scrapy or in Parsel, because these HTML pages are accepted by the browsers, who ignore the null characters.
For this specific case, simply removing the \x00
characters found in the body before passing to the selector would avoid the issue.
What do you think? Where do you think it would be the best place to do this?
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:10 (8 by maintainers)
Top GitHub Comments
I would say this kind of HTML-fixing should be handled by Parsel, but I know of (at least) https://github.com/alecxe/scrapy-beautifulsoup downloader middleware that “pipes” responses through BeautifulSoup (with html5lib parser if requested)
From the README:
@eliasdorneles , have you tested these broken pages with html5lib? If so, there’s also the option of finalizing https://github.com/scrapy/scrapy/pull/1043 in Parsel.
Good catch, thanks @peonone!