Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Clean bad HTML

See original GitHub issue

There are some cases of bad HTML that makes Scrapy (well, lxml really) to choke on the response content, and I was thinking it would make sense to add a CleanBadHtml middleware that could be optionally disabled.

I’ve just stumbled on an example case from a real website, where the response had something like this (real phone number edited) in its content:

text = u'<a href="tel:111\x00111\x001111">111-111-1111</a>'

The \x00 is interpreted as end of input by lxml, so the selector ends up stopping right there:

>>> parsel.Selector(text)
<Selector xpath=None data=u'<html><body><a href="tel:111"></a></body'>
>>> parsel.Selector(text).extract()
u'<html><body><a href="tel:111"></a></body></html>'

I’m not sure what’s the best place to fix this, but I think we gotta do something about it either in Scrapy or in Parsel, because these HTML pages are accepted by the browsers, who ignore the null characters.

For this specific case, simply removing the \x00 characters found in the body before passing to the selector would avoid the issue.

What do you think? Where do you think it would be the best place to do this?

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:10 (8 by maintainers)

Top GitHub Comments

1reaction

redapplecommented, Jan 5, 2017

I would say this kind of HTML-fixing should be handled by Parsel, but I know of (at least) https://github.com/alecxe/scrapy-beautifulsoup downloader middleware that “pipes” responses through BeautifulSoup (with html5lib parser if requested)

From the README:

BeautifulSoup itself with the help of an underlying parser of choice does a pretty good job of handling non-well-formed or broken HTML. In some cases, it makes sense to pipe the HTML through BeautifulSoup to “fix” it.

@eliasdorneles , have you tested these broken pages with html5lib? If so, there’s also the option of finalizing https://github.com/scrapy/scrapy/pull/1043 in Parsel.

0reactions

kmikecommented, Oct 23, 2020

Good catch, thanks @peonone!