UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding
See original GitHub issueWhen outputting a string value containing a supplementary Unicode code point, UTF8JsonGenerator is encoding the supplementary character as a pair of \uNNNN
escapes representing the two halves of the surrogate pair that would denote the code point in UTF-16 instead of using the correct multi-byte UTF-8 encoding of the character. The following Groovy script demonstrates the behaviour:
@Grab(group='com.fasterxml.jackson.core', module='jackson-core', version='2.6.2')
import com.fasterxml.jackson.core.JsonFactory
def factory = new JsonFactory()
def bytes1 = new ByteArrayOutputStream()
def gen1 = factory.createGenerator(bytes1) // UTF8JsonGenerator
gen1.writeStartObject()
gen1.writeStringField("test", new String(Character.toChars(0x1F602)))
gen1.writeEndObject()
gen1.close()
System.out.write(bytes1.toByteArray())
println ""
// prints {"test":"\uD83D\uDE02"}
def bytes2 = new ByteArrayOutputStream()
new OutputStreamWriter(bytes2, "UTF-8").withWriter { w ->
def gen2 = factory.createGenerator(w) // WriterBasedJsonGenerator
gen2.writeStartObject()
gen2.writeStringField("test", new String(Character.toChars(0x1F602)))
gen2.writeEndObject()
gen2.close()
}
System.out.write(bytes2.toByteArray())
println ""
// prints {"test":"😂"}
When generating to a Writer rather than an OutputStream (and letting Java handle the UTF-8 byte conversion) the supplementary character U+1F602 is encoded as the correct UTF-8 four byte sequence f0 9f 98 82
.
Issue Analytics
- State:
- Created 8 years ago
- Reactions:1
- Comments:13 (7 by maintainers)
Top Results From Across the Web
Surrogates and Supplementary Characters - Win32 apps
A supplementary character is a character located beyond the BMP, and a "surrogate" is a UTF-16 code value. For UTF-16, a "surrogate pair"...
Read more >What is a "surrogate pair" in Java? - Stack Overflow
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. In the Unicode...
Read more >FAQ - UTF-8, UTF-16, UTF-32 & BOM - Unicode
The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single 4-byte sequence.
Read more >Unicode Supplementary Characters Test Data
Unicode Scalar Value UTF‑8 NCR
U+2070E 𠜎 𠜎
U+20731 𠜱 𠜱
U+20779 𠝹 𠝹
Read more >Unicode 15 released - LWN.net
Plenty of stuff written on paper could've used that too! ... UTF-8 and 4 byte encodings give a 21 bit space, (but for...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@rkinabhi It is something that would be nice to resolve but I am not actively working on it at this point (I try to add
active
label on things I do work on).From that same section of the spec: