question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DEPRECATION] Moving away from html5lib to html.parser

See original GitHub issue

Starting with pip 22.0, the HTML parsing is done using html.parser instead of html5lib by default. Along with this, there’s an additional check to ensure that a valid HTML 5 doctype declaration is present in the document.

If you’re here from a warning/error from pip’s output:

  • Please reach out to the provider of the package index you’re using and ask them to change the index pages to be valid HTML 5 documents (declaring doctype, having the correct structure etc).
  • You may pass --use-deprecated=html5lib until pip 22.2 (i.e. start of Q3 2022), when this flag will be dropped. This will suppress the warning for now, however you will no longer be able to pass this flag once pip 22.2 is released (and will need to fix the index pages to suppress the warning).

This behaviour change is motivated by two major factors:

  • html5lib is the reason that pip pulls in dropping various other libraries, as part of its own dependency graph. Dropping html5lib and its dependencies from pip, enables reducing the maintainance workload on pip’s maintainers and helps reduce the size of pip’s distributions.
  • The Python standard library’s html.parser is more than sufficient for parsing the pages that pip needs to parse (see https://pypi.org/simple/pip/ for example).

~Barring major surprises, the flag to use html5lib will be removed in 22.1.~ There were surprises.

  • The initial implementation of the html.parser-based parsing enforced that the page contains a doctype, throwing an error if it did not. Turns out, many third-party package indexes did not include a <!doctype html> in their index pages.
  • With pip 22.0.1, certain bugs in the fallback logic were fixed, for pages that did not include the doctype.
  • With pip 22.0.2, a fallback to the legacy html5lib logic was introduced, for pages that don’t start with <!doctype html> (case-insensitive) with a warning presented to the user.
  • With pip 22.0.3, the fallback to the legacy html5lib logic has been removed and the strict error in the html.parser logic has been relaxed to be a warning.
  • With pip 22.0.4, the warning has been removed. Users will no longer get a warning on an invalid or missing doctype. However, this should still be fixed since a future version of pip may start rejecting such pages (after a deprecation period of ~3-6 months).

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:24
  • Comments:131 (61 by maintainers)

github_iconTop GitHub Comments

122reactions
pradyunsgcommented, Jan 31, 2022

Alright. Let’s start with what happened here:

  • 22.0 released:
    • new default HTML parsing logic, that enforced the HTML doctype declaration to be present and be DOCTYPE html according to the HTML 5 spec. https://pypi.org follows this.
    • a bug that prevented --use-deprecated=html5lib from working in certain configurations.
  • 22.0.1 fixed the aforementioned bug, making --use-deprecated=html5lib work and handling doctype declarations case-insensitively.
  • It became increasingly clear that multiple major package indexes (especially commercial ones) were affected.
  • 22.0.2 added (somewhat questionable) fallback logic, to deal with the immediate breakage caused by this change.

Now, the expectation was that this would be a minor annoyance, that very few index servers in the wild would not be following the standards. This turned out to be absurdly wrong. Sure, we’ve got a couple of bugs in the logic and don’t treat certain valid HTML 5 documents as such. Heck, we’re still not checking anything around HTML 5 strictly, except for the doctype. The goal wasn’t “everything needs to strictly be HTML 5”, but neither was it to break a substantial part of our userbase.

BUT… OH BOY, so many of you here are corporate users that seem to not understand how to maintain a healthy relationship with open source projects. No, it is in fact not our responsibility to ensure that your systems keep functioning. It’s your job to do that. No, it’s not my problem that the commercial vendors that you pay thousands of dollars to didn’t follow the standards that explicitly noted how things should work. It’s their job to do that. Some of you got paid for the time you spent because of this issue, I didn’t.

Gosh, if you’re going to argue that you wanted a quicker fix, go talk to people you pay thousands of dollars to, for the services they provide you and ask them why they didn’t help you with this quickly. Stop arguing that I didn’t do something for you, for free, at a quicker pace – you’re not making the point you think you’re making.

I appreciate that users here have noted that yanking releases was a possible option. I’m aware: I helped implement it in pip as well as in PyPI. I’d stated that I don’t think the breakage is widespread enough to do that, and I still don’t think it was. Preparing and making a bugfix release was a much better investment of my time than dealing with the work of yanking a release and going through to undo all the other aspects of our release process that the various bits of automation trigger (newer docs, get-pip.py release reverting, communication around this via our regular release communication channels etc). Sure, you’d only care about the yanking but this project isn’t just a publish-on-PyPI project.


Am I happy that this was as disruptive as this ended up being? No. It negatively affected other community efforts like piwheels, scipy-wheels etc. It ate into pip’s churn budget for very little value and likely paged many people on a Sunday (trust me, I know, it sucks). It gobbled up the one sunny Sunday afternoon I had after a quite a while.

Could something have been done to avoid this reduce the pain here “more quickly”? Sure, at the cost of pulling back significant usability improvements that this release contained, some level of user confusion and more. I don’t think that’s worth it – see comments above as well as the paragraph above.

Could the pip team have been more gradual about rolling out this change? Of course. We could’ve done our regular dance of opt-in, flip-default with an opt-out; like we did for the resolver rollout. It was assumed that it wasn’t worth it in this case, and we were wrong. I was wrong to think Hyrum’s Law won’t apply here while merging the relevant PR, but… yea, it was. Lesson learnt, at the cost of a Sunday and a decent amount of pip’s churn budget. 😃

So… anyway:

  • There’s many vocal users here who IMO don’t understand the relationship between open source and corporate ecosystems. Upstream maintainers are, in fact, not responsible for keeping your company’s workflows functional – those employed by your company are. And the job involves insulating from potential breakages that upstream maintainers introduce. And, yes, investing in the open source projects and sponsoring them can make it less likely that you’d see such disruption – both becuase the projects now actually have incentives to keep a healthy relationship with you and because it also makes it more likely that the project is in a more sustainable state wherein it can do the sort of early testing that would catch issues like this.

  • If you’re a provider of enterprise index server solutions, please look into how you can prevent such breakage from happening again for your users. I’d especially encourage you to explore actively testing against the development version of pip to catch these issues early. If you want to help with pip’s development and bring it to a more sustainable state, I’m sure that these folks would be happy to talk to you about how you can help.

  • If you’re a user of such enterprise solutions, go talk to the vendor you regularly pay thousands of dollars, and ask them why they had not invested in preventing such breakages and protecting you from such breakage and – more importantly – what they’re going to do going forward to help prevent this. And, if you want to help with pip’s development and bring it to a more sustainable state so that you don’t have to rely on me being nice and cutting multiple bugfixes on a weekend, I’m sure that these folks would be happy to talk to you about how you can help.

I’m unlocking this issue now. I’ll remind everyone that the PSF’s CoC applies here – be open, considerate, and respectful.

47reactions
pradyunsgcommented, Feb 1, 2022

Oh, and lemme say something that I just realised I hadn’t already…

To everyone who’s been polite, reached out via other means, said thanks here or elsewhere, or in other ways shown appreciation and understanding toward the situation here — Thank you! I appreciate the kind words and understanding, and I’m sure my fellow maintainers do as well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

HTML5 index page needed for PIP - Nexus Repository Manager
Recently we have been getting the following message: DEPRECATION: The ... be found at [DEPRECATION] Moving away from html5lib to html.parser ...
Read more >
html5lib - PyPI
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all...
Read more >
Get Laser Eyes on Twitter: "We were down because of https://t ...
[DEPRECATION] Moving away from html5lib to html.parser · Issue #10825 ... an additional check to ensure that a valid HTML 5 doctype declara....
Read more >
html5lib Documentation
html5lib is a pure-python library for parsing HTML. ... To get a builder class by name, use the getTreeBuilder() function.
Read more >
Azure Feeds breaks on newest version of Pip
Current workaround is switching to the deprecated html parser with a flag. ... You can try to pass --use-deprecated=html5lib to pip, to make...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found