It's not a good idead to parse HTML text using regular expressions
See original GitHub issueIn w3lib.html
regular expressions are used to parse HTML texts:
_ent_re = re.compile(r'&((?P<named>[a-z\d]+)|#(?P<dec>\d+)|#x(?P<hex>[a-f\d]+))(?P<semicolon>;?)', re.IGNORECASE)
_tag_re = re.compile(r'<[a-zA-Z\/!].*?>', re.DOTALL)
_baseurl_re = re.compile(six.u(r'<base\s[^>]*href\s*=\s*[\"\']\s*([^\"\'\s]+)\s*[\"\']'), re.I)
_meta_refresh_re = re.compile(six.u(r'<meta\s[^>]*http-equiv[^>]*refresh[^>]*content\s*=\s*(?P<quote>["\'])(?P<int>(\d*\.)?\d+)\s*;\s*url=\s*(?P<url>.*?)(?P=quote)'), re.DOTALL | re.IGNORECASE)
_cdata_re = re.compile(r'((?P<cdata_s><!\[CDATA\[)(?P<cdata_d>.*?)(?P<cdata_e>\]\]>))', re.DOTALL)
However this is definitely incorrect when it involves commented contents, e.g.
>>> from w3lib import html
>>> html.get_base_url("""<!-- <base href="http://example.com/" /> -->""")
'http://example.com/'
Introducing “heavier” utilities like lxml
would solve this issue easily, but that might be an awful idea as w3lib
aims to be lightweight & fast.
Or maybe we could implement some quick parser merely for eliminating the commented parts.
Any ideas?
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Using regular expressions to parse HTML: why not?
-1 This answer draws the right conclusion ("It's a bad idea to parse HTML with Regex") from wrong arguments ("Because HTML isn't a...
Read more >Can regular expressions parse HTML or not?
So according to computer science theory, can regular expressions parse HTML? Not by the original meaning of regular expression, but yes, PCRE ...
Read more >When Not to Use Regular Expressions - Atomic Spin
Regex isn't suited to parse HTML because HTML isn't a regular language. Regex probably won't be the tool to reach for when parsing...
Read more >You can't parse HTML with Regular Expressions. - Reddit
It's still a bad idea to write complex parsers in regexes, because it's quite difficult and error-prone and regex syntax quickly gets unhelpful...
Read more >Parsing Html The Cthulhu Way - Coding Horror
HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hello, just for your reference.
I recently tested w3lib’s prescan against 500 most popular websites. I found three bugs (or different behaviors from html5 spec).
books.google.com:
<meta http-equiv="content-type"content="text/html; charset=UTF-8">
(no space between attributes)mega.nz:
<meta http-equiv="Content-Type" content="text/html, charset=UTF-8" />
(comma, not semicolon)stuff.co.nz:
doc.write('<body onload=[...] <meta charset="utf-8"/>
(matching ‘<body’)validator’s, jsdom’s and html5lib-python’s prescan parsers get encoding successfully.
…I don’t know it is a good idea to fix these and make prescan regex even more complex.
@starrify I believe the goal was indeed speed; also, these regexes may take e.g. only first 4096 bytes of the page, without the rest. Ideas about a proper solution are welcome! It should
a) be almost as fast as these regexes; b) work on arbitrarily truncated HTML files.