Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Charset autodetection fail

See original GitHub issue

Hello,

JsonFactory#createParser(InputStream in) has the comment:

Note: no encoding argument is taken since it can always be auto-detected as suggested by JSON RFC.

Looks like it doesn’t work and now we have no ability to explicitly setup encoding.

Below I explain demo for stream with ISO-8859-1 content when our application encoding is UTF-8 (common for java).

Description: Hex dump - hex dump of valid json string payload encoded in ISO-8859-1, contains ã symbol, which has different encoding in UTF-8 and ISO-8859-1

String hexDump = <hex dump >;
byte[] bytes = BaseEncoding.base16().decode(hexDump.toUpperCase()); //load hex dump to byte array
System.out.println("Try to read print it with UTF-8" + new String(bytes)); //we see ? - something can't be decoded
System.out.println("Better (ISO-8859-1): " + new String(bytes, Charset.forName("ISO-8859-1"))); //do it with right encoding and will get valid json dump in console

//trying to create json parser
JsonFactory jf = new JsonFactory();
JsonParser parser = jf.createParser(new ByteArrayInputStream(bytes));
System.out.println(parser.getClass().getSimpleName());  //it will be UTF8StreamJsonParser, which already seems to be incorrect
 while (parser.nextToken() != null) {}

As a result we will get

com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x5f
 at [Source: java.io.ByteArrayInputStream@373bc99b; line: 1, column: 764]
    at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3470)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3477)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipUtf8_3(UTF8StreamJsonParser.java:3353)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipString(UTF8StreamJsonParser.java:2547)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:690)

What’s going on? We have sequence “ã_” in the stream, which is “E35F” in ISO-8859-1 If try to read this as UTF-8 - “E3” will be 1110 0011, which means that it’s one 16-bytes char instead of 2 chars by 8 bytes, which is already incorrect + there is no char with E35F code in utf8 - exception.

Issue Analytics

State:
Created 8 years ago
Comments:14 (9 by maintainers)

Top GitHub Comments

3reactions

cowtowncodercommented, Nov 9, 2015

@Spikhalskiy You did not seem to understand what I am saying wrt tests. Actual usage as it must be done with Jackson is via Reader, and Reader access is tested. To test it specifically with Latin-1 would only test JDK InputStreamReaders decoding capabilities. I trust that to work just fine.

But I also think people really, really, REALLY should not use ISO-8859-1 for anything. I would go as far as to saying that only an idiot would consciously choose it as encoding at this time and day. I know there are idiots designing and building systems all over the place, but it is wrong to encourage bad practices. If I was younger and more idealistic I might even try to explicitly make usage more difficult.

If support for officially non-supported encoding (wrt JSON) is desired, there is always XML which supports a wide variety of encodings using EBCDIC.

0reactions

cowtowncodercommented, Nov 10, 2015

@Spikhalskiy ah ok. Sorry for misreading and taking thread into wrong directions. Updates to javadocs make perfect sense.

@ypriverol javadoc makes sense; README too if there’s appropriate place for that (if not, Wiki).

Top Results From Across the Web

Character encoding auto-detect fails on UTF-8 text file ...

Your file is missing the byte order mark (BOM). All browsers I've tried open it as Western (Windows-1252). Notepad++ identifies it as "ANSI...

Character set autodetection appears to fail - MySQL Bugs

Description: It appears that character set autodetection of JDBC connections fails quite nastily. I'm opening a JDBC connection to a MySQL ...

Charset detection - Wikipedia

One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte...

Refactoring auto-detect file's encoding - java - Stack Overflow

Question: How does this program logic refactor? Which are another ways to detect encoding (as UTF-16 sequance etc.)?.

57151 - Auto-detect encoding failed in Russian page (CP1251)

In this particular case, apparently auto-detector fails to detect the page's encoding. 2. With AutoDetect OFF, the default encoding ( ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Charset autodetection fail

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding

feature to allow trailing commas (non-compliant JSON)