Charset autodetection fail
See original GitHub issueHello,
JsonFactory#createParser(InputStream in) has the comment:
Note: no encoding argument is taken since it can always be auto-detected as suggested by JSON RFC.
Looks like it doesn’t work and now we have no ability to explicitly setup encoding.
Below I explain demo for stream with ISO-8859-1 content when our application encoding is UTF-8 (common for java).
Description: Hex dump - hex dump of valid json string payload encoded in ISO-8859-1, contains ã symbol, which has different encoding in UTF-8 and ISO-8859-1
String hexDump = <hex dump >;
byte[] bytes = BaseEncoding.base16().decode(hexDump.toUpperCase()); //load hex dump to byte array
System.out.println("Try to read print it with UTF-8" + new String(bytes)); //we see ? - something can't be decoded
System.out.println("Better (ISO-8859-1): " + new String(bytes, Charset.forName("ISO-8859-1"))); //do it with right encoding and will get valid json dump in console
//trying to create json parser
JsonFactory jf = new JsonFactory();
JsonParser parser = jf.createParser(new ByteArrayInputStream(bytes));
System.out.println(parser.getClass().getSimpleName()); //it will be UTF8StreamJsonParser, which already seems to be incorrect
while (parser.nextToken() != null) {}
As a result we will get
com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x5f
at [Source: java.io.ByteArrayInputStream@373bc99b; line: 1, column: 764]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3470)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3477)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipUtf8_3(UTF8StreamJsonParser.java:3353)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipString(UTF8StreamJsonParser.java:2547)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:690)
What’s going on? We have sequence “ã_” in the stream, which is “E35F” in ISO-8859-1 If try to read this as UTF-8 - “E3” will be 1110 0011, which means that it’s one 16-bytes char instead of 2 chars by 8 bytes, which is already incorrect + there is no char with E35F code in utf8 - exception.
Issue Analytics
- State:
- Created 8 years ago
- Comments:14 (9 by maintainers)
Top Results From Across the Web
Character encoding auto-detect fails on UTF-8 text file ...
Your file is missing the byte order mark (BOM). All browsers I've tried open it as Western (Windows-1252). Notepad++ identifies it as "ANSI...
Read more >Character set autodetection appears to fail - MySQL Bugs
Description: It appears that character set autodetection of JDBC connections fails quite nastily. I'm opening a JDBC connection to a MySQL ...
Read more >Charset detection - Wikipedia
One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte...
Read more >Refactoring auto-detect file's encoding - java - Stack Overflow
Question: How does this program logic refactor? Which are another ways to detect encoding (as UTF-16 sequance etc.)?.
Read more >57151 - Auto-detect encoding failed in Russian page (CP1251)
In this particular case, apparently auto-detector fails to detect the page's encoding. 2. With AutoDetect OFF, the default encoding ( ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@Spikhalskiy You did not seem to understand what I am saying wrt tests. Actual usage as it must be done with Jackson is via
Reader
, and Reader access is tested. To test it specifically with Latin-1 would only test JDKInputStreamReader
s decoding capabilities. I trust that to work just fine.But I also think people really, really, REALLY should not use ISO-8859-1 for anything. I would go as far as to saying that only an idiot would consciously choose it as encoding at this time and day. I know there are idiots designing and building systems all over the place, but it is wrong to encourage bad practices. If I was younger and more idealistic I might even try to explicitly make usage more difficult.
If support for officially non-supported encoding (wrt JSON) is desired, there is always XML which supports a wide variety of encodings using EBCDIC.
@Spikhalskiy ah ok. Sorry for misreading and taking thread into wrong directions. Updates to javadocs make perfect sense.
@ypriverol javadoc makes sense; README too if there’s appropriate place for that (if not, Wiki).