Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Planning the rest of the changes to HTML parsing in `PackageFinder`

See original GitHub issue

Okay, I’m looking at the logic now and I think I have a concrete suggestion for how to change things that I’d like feedback from other @pypa/pip-team members on:

In 22.0.x, change the fallback logic again – dropping the stupid starts-with-doctype check and using html.parser by default, with the error about doctypes changed to a warning, and the relevant issue continuing to tell users to pass --use-deprecated=html5lib if they are hitting a bug in the parser.
In 22.2, drop the --use-deprecated=html5lib while continuing to warn about bad doctypes.

Issue Analytics

State:
Created 2 years ago
Comments:18 (16 by maintainers)

Top GitHub Comments

8reactions

dstufftcommented, Feb 3, 2022

I never would have expected “HTML5” to be the cause of any kind of contention 😉

Specifying HTML5 was primarily just to remove any questions around how to parse the content. Primarily to short circuit any "well it works in pip… " logic.

For those who aren’t aware of the history, there was a time when PyPI served pages with some commented out HTML, because the parsing in setuptools (and maybe pip too, I don’t recall) would break without that. There is also the differences between xhtml and html (xhtml requiring certain tags to be closed, whereas html doesn’t, etc) and I wanted to cut off any argument about whether a html or xhtml parser was the correct thing to use.

Largely speaking, it was included to say that you should interpret the responses with an html5 parser, and if a valid html5 parser errors out, then that’s a bad API response.

Which brings us to this issue.

The question about whether or not HTML5 requires a doctype is not a simple one to answer (and honestly, if we really wanted to be strict about this, we should amend PEP 503).

The sections that have been linked are intended to be read by people writing HTML5 pages. However there’s another section entirely dedicated to how a parser should parse HTML5.

One of the features of HTML5 is not just that it defines how one should parse HTML written as specified in the authoring section, but also that it defines how one should parse HTML that deviates from those guidelines. The entire document is quite long and complex and I won’t even begin to pretend that I’ve read the entire thing. However as best as I can tell, the parsing document in section 13.2.6.4 states that if it encounters anything but a handful of things (for the purposes of this discussion, white space and doctype), then that generates a parser error. However, it’s basically up to the parser if it bails out at that point or if it continues on, and if it continues on it will then be in the standards defined “quirks” mode.

So roughly, PEP 503 doesn’t really specify enough information to make a decision about what PEP 503 actually requires, but it’s certainly within a valid HTML5 parser’s remit to decide to bail out if the doc type isn’t there. That points to the idea that PEP 503 probably does require it to be there. That being said, I think it’s probably not worth getting too worried about a missing doc type.

I suspect that html.parser isn’t actually a fully valid HTML5 parser, given it calls perfectly valid HTML5 syntax “invalid” in the documentation (though it mentions it to say that it does work…, but the fact it’s calling valid syntax invalid does not inspire confidence that it’s fully implementing HTML5 correctly). If that’s true, to some extent that means that technically using html.parser is, in theory, making a repository have to “guess” what pip supports, rather than being able to just use anything HTML5 allows, and the de facto spec ends up being the subset of HTML5 that html.parser correctly parses (or at least, correctly parses well enough for pip to get the data it needs out of it without error).

Practically speaking, I think switching to html.parser is fine. I think the chances someone is doing something particularly exciting in their repository responses that html.parser doesn’t support is pretty small. Likewise I think not enforcing the doctype existence is fine. While HTML5 does require it (sort of, see above) I don’t see a lot of practical benefit (especially given the change likely makes some other technically valid edge cases broken, so it seems weird to be strict here but not elsewhere).

6reactions

pradyunsgcommented, Feb 1, 2022

I’m fine with dropping the doctype checks as well.

Top Results From Across the Web

8.2 Parsing HTML documents — HTML5 - W3C

8.2 Parsing HTML documents. This section only applies to user agents, data mining tools, and conformance checkers. The rules for parsing XML documents...

Web Scraping and Parsing HTML in Python with Beautiful Soup

Using Requests to scrape data for Beautiful Soup to parse. First let's write some code to grab the HTML from the web page,...

CHANGELOG.md ... - GitLab

Fixed (4 changes). Geo: Fix reverify object stored files (merge request) GitLab Enterprise Edition; Geo: Fix verification failures of remote stored files ...

CRAN Packages By Name - The Comprehensive R Archive Network

AcceptanceSampling, Creation and Evaluation of Acceptance Sampling Plans ... bcpa, Behavioral change point analysis of animal movement.

Put the fun back into computing. Use Linux, BSD. - DistroWatch.com

The new install images carry the version number 20.04.4 and provide both security updates and new hardware support for the Ubuntu family of...