Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make Get-Content recognize XML file encodings

See original GitHub issue

Issue

The commonly-used pattern for reading XML files goes like this:

$doc = [xml](Get-Content xyz.xml)

It’s fast, convenient, ubiquitous and … wrong.

It does not pay attention to XML file encodings and happily (and silently) mangles data from XML files that do not happen to be in the expected encoding (i.e. UTF-8), or in an encoding that mandates a BOM (i.e. UTF-16 and friends). See #14505 and #14404 for more in-depth discussion.

Sadly this usage pattern is not going anywhere. Far too many examples all over the Internet show wrong usage, far too few people are aware of - or care about - how XML implements encodings.

The ideal solution would be to teach people how to do it correctly, the pragmatic solution is to make the pattern above do the right thing.

Proposal

Make Get-Content recognize XML files and switch file encodings on-the-fly. This is what XML parsers do as well.

It’s easy, since the first line of the file will contain the XML declaration <?xml version="..." encoding="..."?> which has the necessary encoding information.

In the absence of an XML declaration, assuming UTF-8, i.e. exactly what Get-Content normally does, is correct.

Proposed technical implementation details

An implementation for Get-Content encoding sniffing should go like this

if there’s a BOM, use that (that’s already happening)
otherwise, if there is an -Encoding parameter, use that
otherwise, look at the first couple of bytes of the file as ASCII
- if it starts with an XML declaration <?xml, parse out the encoding="..." value
  - if it’s a recognized encoding name, switch the file stream to a new encoding for the rest of the file
  - otherwise, default to UTF-8
- otherwise, assume plain text in UTF-8 (that’s already happening)
- not sure if viable, but nice to have: In the “UTF-8 assumed” case, if there are decode errors, rewind stream & fall back to the system’s default ANSI encoding (might already happen, I haven’t checked)

Actual implementation details can be derived from the way System.Xml.XmlDocument implements it in its .Load() method, I’m sure there are some corner cases.

Benefits

It would make things transparently correct for existing scripts without breaking any of them.
It would make things transparently correct for anyone copying code off of the Internet/who’s not deep enough into the details of how XML implements file encodings.
All the thousands of bad examples from the Internet would be fine on one sweep.
It would help people who naively (or for performance reasons) process XML data line-wise as plain text.

Issue Analytics

State:
Created 3 years ago
Reactions:5
Comments:8 (3 by maintainers)

Top GitHub Comments

2reactions

iSazonovcommented, Jan 24, 2022

From the WG conclusion it is not clear what is new suggestion to resolve the issue.

1reaction

Tomalakcommented, Dec 30, 2020

@mklement0 Yeah, I don’t feel strongly enough about this to make yet another proposal out of it. It’s a messy topic and there is no way my assumptions aren’t biased. The more complex this gets, the higher the chance to introduce unpredictable behavior or new bugs.

Top Results From Across the Web

Powershell: Setting Encoding for Get-Content Pipeline

My understanding is that the -encoding option selects the encdoing that the file should be read or written in. Share.

Select-Xml (Microsoft.PowerShell.Utility)

This example shows how to use the XML parameter to provide an XML document to the Select-Xml cmdlet. The Get-Content cmdlet gets the...

Parse XML data

The steps for parsing an XML feed are as follows: As described in Analyze the feed, identify the tags you want to include...

Content-Encoding - HTTP - MDN Web Docs

The Content-Encoding representation header lists any encodings that have been applied to the representation (message payload), ...

A Guide to UTF-8 Encoding in PHP and MySQL

UTF-8 encodes each character using one to four bytes. The first 128 characters of Unicode correspond one-to-one with ASCII, making valid ASCII text...