[DEPRECATION] Moving away from html5lib to html.parser
See original GitHub issueStarting with pip 22.0, the HTML parsing is done using html.parser
instead of html5lib
by default. Along with this, there’s an additional check to ensure that a valid HTML 5 doctype declaration is present in the document.
If you’re here from a warning/error from pip’s output:
- Please reach out to the provider of the package index you’re using and ask them to change the index pages to be valid HTML 5 documents (declaring doctype, having the correct structure etc).
- You may pass
--use-deprecated=html5lib
until pip 22.2 (i.e. start of Q3 2022), when this flag will be dropped. This will suppress the warning for now, however you will no longer be able to pass this flag once pip 22.2 is released (and will need to fix the index pages to suppress the warning).
This behaviour change is motivated by two major factors:
- html5lib is the reason that pip pulls in dropping various other libraries, as part of its own dependency graph. Dropping html5lib and its dependencies from pip, enables reducing the maintainance workload on pip’s maintainers and helps reduce the size of pip’s distributions.
- The Python standard library’s
html.parser
is more than sufficient for parsing the pages that pip needs to parse (see https://pypi.org/simple/pip/ for example).
~Barring major surprises, the flag to use html5lib will be removed in 22.1.~ There were surprises.
- The initial implementation of the
html.parser
-based parsing enforced that the page contains a doctype, throwing an error if it did not. Turns out, many third-party package indexes did not include a<!doctype html>
in their index pages. - With pip 22.0.1, certain bugs in the fallback logic were fixed, for pages that did not include the doctype.
- With pip 22.0.2, a fallback to the legacy html5lib logic was introduced, for pages that don’t start with
<!doctype html>
(case-insensitive) with a warning presented to the user. - With pip 22.0.3, the fallback to the legacy html5lib logic has been removed and the strict error in the
html.parser
logic has been relaxed to be a warning. - With pip 22.0.4, the warning has been removed. Users will no longer get a warning on an invalid or missing doctype. However, this should still be fixed since a future version of pip may start rejecting such pages (after a deprecation period of ~3-6 months).
Issue Analytics
- State:
- Created 2 years ago
- Reactions:24
- Comments:131 (61 by maintainers)
Top Results From Across the Web
HTML5 index page needed for PIP - Nexus Repository Manager
Recently we have been getting the following message: DEPRECATION: The ... be found at [DEPRECATION] Moving away from html5lib to html.parser ...
Read more >html5lib - PyPI
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all...
Read more >Get Laser Eyes on Twitter: "We were down because of https://t ...
[DEPRECATION] Moving away from html5lib to html.parser · Issue #10825 ... an additional check to ensure that a valid HTML 5 doctype declara....
Read more >html5lib Documentation
html5lib is a pure-python library for parsing HTML. ... To get a builder class by name, use the getTreeBuilder() function.
Read more >Azure Feeds breaks on newest version of Pip
Current workaround is switching to the deprecated html parser with a flag. ... You can try to pass --use-deprecated=html5lib to pip, to make...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Alright. Let’s start with what happened here:
DOCTYPE html
according to the HTML 5 spec. https://pypi.org follows this.--use-deprecated=html5lib
from working in certain configurations.--use-deprecated=html5lib
work and handling doctype declarations case-insensitively.Now, the expectation was that this would be a minor annoyance, that very few index servers in the wild would not be following the standards. This turned out to be absurdly wrong. Sure, we’ve got a couple of bugs in the logic and don’t treat certain valid HTML 5 documents as such. Heck, we’re still not checking anything around HTML 5 strictly, except for the doctype. The goal wasn’t “everything needs to strictly be HTML 5”, but neither was it to break a substantial part of our userbase.
BUT… OH BOY, so many of you here are corporate users that seem to not understand how to maintain a healthy relationship with open source projects. No, it is in fact not our responsibility to ensure that your systems keep functioning. It’s your job to do that. No, it’s not my problem that the commercial vendors that you pay thousands of dollars to didn’t follow the standards that explicitly noted how things should work. It’s their job to do that. Some of you got paid for the time you spent because of this issue, I didn’t.
Gosh, if you’re going to argue that you wanted a quicker fix, go talk to people you pay thousands of dollars to, for the services they provide you and ask them why they didn’t help you with this quickly. Stop arguing that I didn’t do something for you, for free, at a quicker pace – you’re not making the point you think you’re making.
I appreciate that users here have noted that yanking releases was a possible option. I’m aware: I helped implement it in pip as well as in PyPI. I’d stated that I don’t think the breakage is widespread enough to do that, and I still don’t think it was. Preparing and making a bugfix release was a much better investment of my time than dealing with the work of yanking a release and going through to undo all the other aspects of our release process that the various bits of automation trigger (newer docs, get-pip.py release reverting, communication around this via our regular release communication channels etc). Sure, you’d only care about the yanking but this project isn’t just a publish-on-PyPI project.
Am I happy that this was as disruptive as this ended up being? No. It negatively affected other community efforts like piwheels, scipy-wheels etc. It ate into pip’s churn budget for very little value and likely paged many people on a Sunday (trust me, I know, it sucks). It gobbled up the one sunny Sunday afternoon I had after a quite a while.
Could something have been done to avoid this reduce the pain here “more quickly”? Sure, at the cost of pulling back significant usability improvements that this release contained, some level of user confusion and more. I don’t think that’s worth it – see comments above as well as the paragraph above.
Could the pip team have been more gradual about rolling out this change? Of course. We could’ve done our regular dance of opt-in, flip-default with an opt-out; like we did for the resolver rollout. It was assumed that it wasn’t worth it in this case, and we were wrong. I was wrong to think Hyrum’s Law won’t apply here while merging the relevant PR, but… yea, it was. Lesson learnt, at the cost of a Sunday and a decent amount of pip’s churn budget. 😃
So… anyway:
There’s many vocal users here who IMO don’t understand the relationship between open source and corporate ecosystems. Upstream maintainers are, in fact, not responsible for keeping your company’s workflows functional – those employed by your company are. And the job involves insulating from potential breakages that upstream maintainers introduce. And, yes, investing in the open source projects and sponsoring them can make it less likely that you’d see such disruption – both becuase the projects now actually have incentives to keep a healthy relationship with you and because it also makes it more likely that the project is in a more sustainable state wherein it can do the sort of early testing that would catch issues like this.
If you’re a provider of enterprise index server solutions, please look into how you can prevent such breakage from happening again for your users. I’d especially encourage you to explore actively testing against the development version of pip to catch these issues early. If you want to help with pip’s development and bring it to a more sustainable state, I’m sure that these folks would be happy to talk to you about how you can help.
If you’re a user of such enterprise solutions, go talk to the vendor you regularly pay thousands of dollars, and ask them why they had not invested in preventing such breakages and protecting you from such breakage and – more importantly – what they’re going to do going forward to help prevent this. And, if you want to help with pip’s development and bring it to a more sustainable state so that you don’t have to rely on me being nice and cutting multiple bugfixes on a weekend, I’m sure that these folks would be happy to talk to you about how you can help.
I’m unlocking this issue now. I’ll remind everyone that the PSF’s CoC applies here – be open, considerate, and respectful.
Oh, and lemme say something that I just realised I hadn’t already…
To everyone who’s been polite, reached out via other means, said thanks here or elsewhere, or in other ways shown appreciation and understanding toward the situation here — Thank you! I appreciate the kind words and understanding, and I’m sure my fellow maintainers do as well.