question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding

See original GitHub issue

When outputting a string value containing a supplementary Unicode code point, UTF8JsonGenerator is encoding the supplementary character as a pair of \uNNNN escapes representing the two halves of the surrogate pair that would denote the code point in UTF-16 instead of using the correct multi-byte UTF-8 encoding of the character. The following Groovy script demonstrates the behaviour:

@Grab(group='com.fasterxml.jackson.core', module='jackson-core', version='2.6.2')
import com.fasterxml.jackson.core.JsonFactory

def factory = new JsonFactory()
def bytes1 = new ByteArrayOutputStream()
def gen1 = factory.createGenerator(bytes1) // UTF8JsonGenerator
gen1.writeStartObject()
gen1.writeStringField("test", new String(Character.toChars(0x1F602)))
gen1.writeEndObject()
gen1.close()
System.out.write(bytes1.toByteArray())
println ""
// prints {"test":"\uD83D\uDE02"}


def bytes2 = new ByteArrayOutputStream()
new OutputStreamWriter(bytes2, "UTF-8").withWriter { w ->
  def gen2 = factory.createGenerator(w) // WriterBasedJsonGenerator
  gen2.writeStartObject()
  gen2.writeStringField("test", new String(Character.toChars(0x1F602)))
  gen2.writeEndObject()
  gen2.close()
}
System.out.write(bytes2.toByteArray())
println ""
// prints {"test":"😂"}

When generating to a Writer rather than an OutputStream (and letting Java handle the UTF-8 byte conversion) the supplementary character U+1F602 is encoded as the correct UTF-8 four byte sequence f0 9f 98 82.

Issue Analytics

  • State:open
  • Created 8 years ago
  • Reactions:1
  • Comments:13 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
cowtowncodercommented, Nov 12, 2019

@rkinabhi It is something that would be nice to resolve but I am not actively working on it at this point (I try to add active label on things I do work on).

1reaction
ianrobertscommented, Oct 12, 2015

From that same section of the spec:

All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Surrogates and Supplementary Characters - Win32 apps
A supplementary character is a character located beyond the BMP, and a "surrogate" is a UTF-16 code value. For UTF-16, a "surrogate pair"...
Read more >
What is a "surrogate pair" in Java? - Stack Overflow
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. In the Unicode...
Read more >
FAQ - UTF-8, UTF-16, UTF-32 & BOM - Unicode
The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single 4-byte sequence.
Read more >
Unicode Supplementary Characters Test Data
Unicode Scalar Value UTF‑8 NCR U+2070E 𠜎 𠜎 U+20731 𠜱 𠜱 U+20779 𠝹 𠝹
Read more >
Unicode 15 released - LWN.net
Plenty of stuff written on paper could've used that too! ... UTF-8 and 4 byte encodings give a 21 bit space, (but for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found