question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

It's not a good idead to parse HTML text using regular expressions

See original GitHub issue

In w3lib.html regular expressions are used to parse HTML texts:

_ent_re = re.compile(r'&((?P<named>[a-z\d]+)|#(?P<dec>\d+)|#x(?P<hex>[a-f\d]+))(?P<semicolon>;?)', re.IGNORECASE)
_tag_re = re.compile(r'<[a-zA-Z\/!].*?>', re.DOTALL)
_baseurl_re = re.compile(six.u(r'<base\s[^>]*href\s*=\s*[\"\']\s*([^\"\'\s]+)\s*[\"\']'), re.I)
_meta_refresh_re = re.compile(six.u(r'<meta\s[^>]*http-equiv[^>]*refresh[^>]*content\s*=\s*(?P<quote>["\'])(?P<int>(\d*\.)?\d+)\s*;\s*url=\s*(?P<url>.*?)(?P=quote)'), re.DOTALL | re.IGNORECASE)
_cdata_re = re.compile(r'((?P<cdata_s><!\[CDATA\[)(?P<cdata_d>.*?)(?P<cdata_e>\]\]>))', re.DOTALL)

However this is definitely incorrect when it involves commented contents, e.g.

>>> from w3lib import html
>>> html.get_base_url("""<!-- <base href="http://example.com/" /> -->""")
'http://example.com/'

Introducing “heavier” utilities like lxml would solve this issue easily, but that might be an awful idea as w3lib aims to be lightweight & fast.
Or maybe we could implement some quick parser merely for eliminating the commented parts.

Any ideas?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
openandclosecommented, Oct 27, 2019

Hello, just for your reference.

I recently tested w3lib’s prescan against 500 most popular websites. I found three bugs (or different behaviors from html5 spec).

books.google.com: <meta http-equiv="content-type"content="text/html; charset=UTF-8"> (no space between attributes)

mega.nz: <meta http-equiv="Content-Type" content="text/html, charset=UTF-8" /> (comma, not semicolon)

stuff.co.nz: doc.write('<body onload=[...] <meta charset="utf-8"/> (matching ‘<body’)

validator’s, jsdom’s and html5lib-python’s prescan parsers get encoding successfully.

…I don’t know it is a good idea to fix these and make prescan regex even more complex.

1reaction
kmikecommented, Aug 12, 2016

@starrify I believe the goal was indeed speed; also, these regexes may take e.g. only first 4096 bytes of the page, without the rest. Ideas about a proper solution are welcome! It should

a) be almost as fast as these regexes; b) work on arbitrarily truncated HTML files.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using regular expressions to parse HTML: why not?
-1 This answer draws the right conclusion ("It's a bad idea to parse HTML with Regex") from wrong arguments ("Because HTML isn't a...
Read more >
Can regular expressions parse HTML or not?
So according to computer science theory, can regular expressions parse HTML? Not by the original meaning of regular expression, but yes, PCRE ...
Read more >
When Not to Use Regular Expressions - Atomic Spin
Regex isn't suited to parse HTML because HTML isn't a regular language. Regex probably won't be the tool to reach for when parsing...
Read more >
You can't parse HTML with Regular Expressions. - Reddit
It's still a bad idea to write complex parsers in regexes, because it's quite difficult and error-prone and regex syntax quickly gets unhelpful...
Read more >
Parsing Html The Cthulhu Way - Coding Horror
HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found