Proposal for breaking change: removing the `root` element
See original GitHub issueBackground
Cheerio’s load
method accepts a string of markup and creates a selector
function. This function is “bound” to a document whose contents is a node
structure based on the input markup. This function is intended to behave much
like the global jQuery
/$
function provided by the jQuery library (as if the
library had been loaded in a document generated from the same markup by a web
browser).
Historically, Cheerio has always attached the parsed markup to a non-standard
“root” element (i.e. <root>
). As far as I know, this was implemented to
support the load
method’s behavior when given markup describing a document
fragment–strings like '<p>1</p><p>2</p>'
could be passed to load
and still
produce a single top-level element.
Thanks to @inikulin and his Parse5 library, the release candidate for version
1.0.0 normalizes the parsing behavior of load
. It always produces a
complete document–just a like a web browser. The result is much more
predictable, standards-compliant, and “familiar” to web developers. It’s a
backwards-breaking change, though, and we hope to ease the upgrade path for
consumers through a concise migration guide.
Proposal
Since we are already committed to a breaking change for version 1.0, I wanted
to consider making Cheerio’s behavior even more browser-like. I’d like to get
rid of the <root>
element and instead rely on the <html>
element (which
again, is either described by the consumer’s input markup or automatically
created by Parse5).
This would involve removing the Cheerio-specific $.root()
method from the
API. Users who previously used it as a basis for traversal could re-write code
like $.root().find('div')
with $('html').find('div')
.
However, many use cases involve rendering full documents. For this,
$('html').html()
is not equivalent because the resultant string does not
include the document element itself. So we’ll need to keep another
Cheerio-specific method: $.html()
.
Other than that, I think it would just be matter of updating Cheerio’s
internals to operate without <root>
. I spent just enough time trying this out
to see that it is not trivial, but I don’t believe that there is any technical
reason it can’t be done.
I’d love to get feedback from any Cheerio user, but I’m particularly interested in hearing from @fb55, @matthewmueller, and @inikulin. Do you think this is a good idea? Do you think it would invalidate any existing use cases? Or is there any other reason it isn’t technically possible? Or more subjectively, do you think the change would be too jarring for consumers to justify the benefit?
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
Thanks for the feedback, @inikulin! I wanted to push for a more standards-compliant API if only to make the internals more familiar for contributors. After experimenting some more, though, I’ve come to realize that whatever the case, jQuery (and by extension Cheerio) does not offer an API for working with the owner document.
One day, it would be nice if we could more concretely document and support direct interaction with Cheerio’s DOM. I’m not sure if this is a realistic goal, though: if the DOM is implemented as a static data structure, then it will always be dangerous to encourage end-user manipulation–many modifications could invalidate the document and cause instability in Cheerio’s behavior.
So until then, we’ll always need a Cheerio-specific API to support this use case. We might make changes to the underlying structure (for instance, using a “true” document node as opposed to a Element with tag name “root”), but that will be a implementation detail that we introduce essentially just for the sake of conformance; it won’t effect the API. In other words: I don’t have to block version 1.0 with my pedantic objections 😃
@jugglinmike does this mean that some form of
.root()
will end up in v1?