question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Charset autodetection fail

See original GitHub issue

Hello,

JsonFactory#createParser(InputStream in) has the comment:

Note: no encoding argument is taken since it can always be auto-detected as suggested by JSON RFC.

Looks like it doesn’t work and now we have no ability to explicitly setup encoding.

Below I explain demo for stream with ISO-8859-1 content when our application encoding is UTF-8 (common for java).

Description: Hex dump - hex dump of valid json string payload encoded in ISO-8859-1, contains ã symbol, which has different encoding in UTF-8 and ISO-8859-1

String hexDump = <hex dump >;
byte[] bytes = BaseEncoding.base16().decode(hexDump.toUpperCase()); //load hex dump to byte array
System.out.println("Try to read print it with UTF-8" + new String(bytes)); //we see ? - something can't be decoded
System.out.println("Better (ISO-8859-1): " + new String(bytes, Charset.forName("ISO-8859-1"))); //do it with right encoding and will get valid json dump in console

//trying to create json parser
JsonFactory jf = new JsonFactory();
JsonParser parser = jf.createParser(new ByteArrayInputStream(bytes));
System.out.println(parser.getClass().getSimpleName());  //it will be UTF8StreamJsonParser, which already seems to be incorrect
 while (parser.nextToken() != null) {}

As a result we will get

com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x5f
 at [Source: java.io.ByteArrayInputStream@373bc99b; line: 1, column: 764]
    at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3470)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3477)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipUtf8_3(UTF8StreamJsonParser.java:3353)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipString(UTF8StreamJsonParser.java:2547)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:690)

What’s going on? We have sequence “ã_” in the stream, which is “E35F” in ISO-8859-1 If try to read this as UTF-8 - “E3” will be 1110 0011, which means that it’s one 16-bytes char instead of 2 chars by 8 bytes, which is already incorrect + there is no char with E35F code in utf8 - exception.

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:14 (9 by maintainers)

github_iconTop GitHub Comments

3reactions
cowtowncodercommented, Nov 9, 2015

@Spikhalskiy You did not seem to understand what I am saying wrt tests. Actual usage as it must be done with Jackson is via Reader, and Reader access is tested. To test it specifically with Latin-1 would only test JDK InputStreamReaders decoding capabilities. I trust that to work just fine.

But I also think people really, really, REALLY should not use ISO-8859-1 for anything. I would go as far as to saying that only an idiot would consciously choose it as encoding at this time and day. I know there are idiots designing and building systems all over the place, but it is wrong to encourage bad practices. If I was younger and more idealistic I might even try to explicitly make usage more difficult.

If support for officially non-supported encoding (wrt JSON) is desired, there is always XML which supports a wide variety of encodings using EBCDIC.

0reactions
cowtowncodercommented, Nov 10, 2015

@Spikhalskiy ah ok. Sorry for misreading and taking thread into wrong directions. Updates to javadocs make perfect sense.

@ypriverol javadoc makes sense; README too if there’s appropriate place for that (if not, Wiki).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Character encoding auto-detect fails on UTF-8 text file ...
Your file is missing the byte order mark (BOM). All browsers I've tried open it as Western (Windows-1252). Notepad++ identifies it as "ANSI...
Read more >
Character set autodetection appears to fail - MySQL Bugs
Description: It appears that character set autodetection of JDBC connections fails quite nastily. I'm opening a JDBC connection to a MySQL ...
Read more >
Charset detection - Wikipedia
One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte...
Read more >
Refactoring auto-detect file's encoding - java - Stack Overflow
Question: How does this program logic refactor? Which are another ways to detect encoding (as UTF-16 sequance etc.)?.
Read more >
57151 - Auto-detect encoding failed in Russian page (CP1251)
In this particular case, apparently auto-detector fails to detect the page's encoding. 2. With AutoDetect OFF, the default encoding ( ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found