Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

XML Declaration ignored by DOMParser

See original GitHub issue

Basic info:

Node.js version: v10.15.3
jsdom version: 15.1.1

Minimal reproduction case

const { JSDOM } = require("jsdom");
const { XMLSerializer } = require("w3c-xmlserializer");

const inputXml = `<?xml version="1.0" encoding="ASCII"?><test>hello</test>`;
const options = { contentType: "application/xml" };
const dom = new JSDOM(inputXml, options);
const XMLSerializer_ctor = XMLSerializer.interface;
const serializer = new XMLSerializer_ctor();
const outputXml = serializer.serializeToString(dom.window.document);

console.log("inputXml:");
console.log(inputXml);
console.log("outputXml:");
console.log(outputXml); // Expected this to match inputXml.

I realize this includes w3c-xmlserializer, but I don’t see any other way to demonstrate the full process without it, since the deserialization is not done by jsdom itself.
I had initially logged a bug about this in the saxes project, but they seemed to think DOMParser needs to retrieve the XML Declaration details from xmlDecl. (saxes issue #16)
It sounds like I’m re-stating #415, but it did not actually address the original problem as described. It allowed Processing Instructions to be parsed, but not the actual XML Declaration, which was what the bug was about. Crucially, saxes does not emit the onprocessinginstruction event for the XML Declaration, just other Processing Instructions.

How does similar code behave in browsers?

Example in jsbin

Firefox produces an XML Declaration, but the encoding is changed to "UTF-8".
Chrome produces an XML Declaration that matches the original.
IE does not produce an XML Declaration.
Edge does not produce an XML Declaration.

Issue Analytics

State:
Created 4 years ago
Comments:11 (5 by maintainers)

Top GitHub Comments

1reaction

Sebmastercommented, Jun 21, 2019

I would like to ensure that the encoding that I specify is correct.

I’ve been thinking about this. When you serialise the doc, you just get back a JS string, which (I think) means it should be encoding agnostic.

I think what you write into the XML declaration depends on how you write the file. If you use fs.writeFile and specify the string without an encoding, the file will always end up with utf-8 encoding.

However, it seems like jsdom always does the HTML encoding sniffing, even for XML docs. I’m not sure if that’s intended or if there’s a bug lurking there, but that could lead to double decodes 🤷‍♂ Definitely room to test that better there.

1reaction

domeniccommented, Jun 21, 2019

We have discussions about this in the context of HTML parsing/serialization all the time. In short, serialization/parsing are not meant to preserve the original form of the document. They are only preserving of the original information (i.e., the abstract stuff that survives into the parsed form). (And sometimes, not even that; see the warning and examples below the algorithm at https://html.spec.whatwg.org/#serialising-html-fragments). See https://github.com/inikulin/parse5/issues/261#issuecomment-401389295 for more.

As it currently stands, if I wanted to do the latter option myself, would it be safe to use dom.window.document.inputEncoding to detect the encoding that is in use?

I’m not sure, as I’m not sure what definition of “safety” you’re using. But see https://dom.spec.whatwg.org/#dom-document-inputencoding for the definition of inputEncoding.