Make Get-Content recognize XML file encodings
See original GitHub issueIssue
The commonly-used pattern for reading XML files goes like this:
$doc = [xml](Get-Content xyz.xml)
It’s fast, convenient, ubiquitous and … wrong.
It does not pay attention to XML file encodings and happily (and silently) mangles data from XML files that do not happen to be in the expected encoding (i.e. UTF-8), or in an encoding that mandates a BOM (i.e. UTF-16 and friends). See #14505 and #14404 for more in-depth discussion.
Sadly this usage pattern is not going anywhere. Far too many examples all over the Internet show wrong usage, far too few people are aware of - or care about - how XML implements encodings.
The ideal solution would be to teach people how to do it correctly, the pragmatic solution is to make the pattern above do the right thing.
Proposal
Make Get-Content
recognize XML files and switch file encodings on-the-fly. This is what XML parsers do as well.
It’s easy, since the first line of the file will contain the XML declaration <?xml version="..." encoding="..."?>
which has the necessary encoding information.
In the absence of an XML declaration, assuming UTF-8, i.e. exactly what Get-Content
normally does, is correct.
Proposed technical implementation details
An implementation for Get-Content
encoding sniffing should go like this
- if there’s a BOM, use that (that’s already happening)
- otherwise, if there is an
-Encoding
parameter, use that - otherwise, look at the first couple of bytes of the file as ASCII
- if it starts with an XML declaration
<?xml
, parse out theencoding="..."
value- if it’s a recognized encoding name, switch the file stream to a new encoding for the rest of the file
- otherwise, default to UTF-8
- otherwise, assume plain text in UTF-8 (that’s already happening)
- not sure if viable, but nice to have: In the “UTF-8 assumed” case, if there are decode errors, rewind stream & fall back to the system’s default ANSI encoding (might already happen, I haven’t checked)
- if it starts with an XML declaration
Actual implementation details can be derived from the way System.Xml.XmlDocument
implements it in its .Load()
method, I’m sure there are some corner cases.
Benefits
- It would make things transparently correct for existing scripts without breaking any of them.
- It would make things transparently correct for anyone copying code off of the Internet/who’s not deep enough into the details of how XML implements file encodings.
- All the thousands of bad examples from the Internet would be fine on one sweep.
- It would help people who naively (or for performance reasons) process XML data line-wise as plain text.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:5
- Comments:8 (3 by maintainers)
Top GitHub Comments
From the WG conclusion it is not clear what is new suggestion to resolve the issue.
@mklement0 Yeah, I don’t feel strongly enough about this to make yet another proposal out of it. It’s a messy topic and there is no way my assumptions aren’t biased. The more complex this gets, the higher the chance to introduce unpredictable behavior or new bugs.