question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unicode: surrogate pairs encoded invalidly after x number of chars

See original GitHub issue

The problem here is that the character Kappa starts being encoded as xml entities, unfortunately this is an non valid character encoding. I don’t understand why this is happening and why it happens after X characters instead of at any point.

@Test
public void tooManyKappas()
	 throws XMLStreamException
{
	XMLOutputFactory factory = OutputFactoryImpl.newInstance();
	if (factory instanceof OutputFactoryImpl) {
	     ((OutputFactoryImpl) factory).configureForSpeed();
        }
        //loop to find exactly at which point entity encoding kicks in.
	for (int j = 0; j < 1000; j++) {
		final ByteArrayOutputStream baos = new ByteArrayOutputStream();
		XMLStreamWriter writer = factory.createXMLStreamWriter(baos, StandardCharsets.UTF_8.name());

		final String namespace = "http://example.org";

		StringBuilder kappas = new StringBuilder();

		for (int i = 0; i < (2000 + j); i++) {
			kappas.append("𝜅");
		}
		writer.writeStartElement("", "ex", namespace);
		writer.writeCharacters(kappas.toString());
		writer.writeEndElement();
		writer.close();

		assertEquals("fails at " + (2000 + j),
			    "<ex>" + kappas + "</ex>",
			    new String(baos.toByteArray(), StandardCharsets.UTF_8));
	}
}

I hope this minimized test case is off help. It’s definitely due to something internal to aalto. WSTX does not have this issue (or at a much higher loop number…).

The problem really is that aalto-xml reader does not deal with its own output in this case. Which is correct as its the writer that is wrong.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
cowtowncodercommented, Apr 2, 2018

@JervenBolleman happy to fix it – it has been a while as I am not that active with xml tools these days, but feels good to clean up the backlog. Aalto is not nearly as well tested as Woodstox is, but would be great to get it to same level of quality.

0reactions
JervenBollemancommented, Apr 2, 2018

Thank you so much for picking this up. It will be great to use aalto-xml again, it’s a real speed boost! I really appreciate it that you took the time to look into this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What is a "surrogate pair" in Java? - Stack Overflow
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.
Read more >
The surrogate pair (0xD993, 0x2A0) is invalid. A high ... - MSDN
The error message is very specific: those two values -- 0xD993 and 0x2A0 -- are not a valid combination. In UTF-16, all characters...
Read more >
Surrogate Characters | iOS Internationalization - InformIT
A lone surrogate is invalid in UTF-16; surrogates are always written in pairs, with the high surrogate followed by the low. With UTF-16...
Read more >
Even though this is 2018, occasionally someone will try to ...
Unicode currently promises to cap at 17 planes because that's the limit of what UTF-16 can encode with surrogate pairs; anything beyond Plane...
Read more >
Unicode and .NET - Jon Skeet
NET. Each character is encoded as a sequence of 2 bytes, other than surrogates which take 4 bytes. The opportunity of using surrogates...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found