question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

There are some cases of bad HTML that makes Scrapy (well, lxml really) to choke on the response content, and I was thinking it would make sense to add a CleanBadHtml middleware that could be optionally disabled.

I’ve just stumbled on an example case from a real website, where the response had something like this (real phone number edited) in its content:

text = u'<a href="tel:111\x00111\x001111">111-111-1111</a>'

The \x00 is interpreted as end of input by lxml, so the selector ends up stopping right there:

>>> parsel.Selector(text)
<Selector xpath=None data=u'<html><body><a href="tel:111"></a></body'>
>>> parsel.Selector(text).extract()
u'<html><body><a href="tel:111"></a></body></html>'

I’m not sure what’s the best place to fix this, but I think we gotta do something about it either in Scrapy or in Parsel, because these HTML pages are accepted by the browsers, who ignore the null characters.

For this specific case, simply removing the \x00 characters found in the body before passing to the selector would avoid the issue.

What do you think? Where do you think it would be the best place to do this?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:10 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
redapplecommented, Jan 5, 2017

I would say this kind of HTML-fixing should be handled by Parsel, but I know of (at least) https://github.com/alecxe/scrapy-beautifulsoup downloader middleware that “pipes” responses through BeautifulSoup (with html5lib parser if requested)

From the README:

BeautifulSoup itself with the help of an underlying parser of choice does a pretty good job of handling non-well-formed or broken HTML. In some cases, it makes sense to pipe the HTML through BeautifulSoup to “fix” it.

@eliasdorneles , have you tested these broken pages with html5lib? If so, there’s also the option of finalizing https://github.com/scrapy/scrapy/pull/1043 in Parsel.

0reactions
kmikecommented, Oct 23, 2020

Good catch, thanks @peonone!

Read more comments on GitHub >

github_iconTop Results From Across the Web

12 Principles For Clean HTML Code - Smashing Magazine
12 Principles For Clean HTML Code · 1. Strict DOCTYPE # · 2. Character set & encoding characters # · 3. Proper indentation...
Read more >
8 Best Practices to Write Clean HTML Code | by Sergi Marquez
Writing clean HTML code isn't always about choosing good practices and avoiding bad ones. Many times you can use different approaches to ...
Read more >
Clean up your Web pages with HTML TIDY - W3C
Cleaning up presentational markup ... Many tools generate HTML with an excess of FONT, NOBR and CENTER tags. Tidy's -clean option will replace...
Read more >
DirtyMarkup Formatter - HTML, CSS, & JavaScript Beautifier
A free tool to clean up your dirty code. DirtyMarkup is the best HTML, CSS, and Javascript (JS) beautifier with an API to...
Read more >
Clean Code: Why HTML/CSS Essentials Still Matter - Toptal
How to Write Clean HTML · Try to have a front-end that is as markup-based as possible. · Avoid using unnecessary wrappers in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found