Behaviour of Bytestring.utf8() with bytes that aren't valid UTF-8
See original GitHub issueIs there a defined behaviour for ByteString.utf8(...)
for sequences of bytes which are not valid UTF-8?
I’ve tried the following:
bytes[] invalidUtf8 = {(byte)0xFF}; // 0xFF will never appear in valid UTF-8
ByteString b = ByteString.of(invalidUtf8);
System.out.println(b.utf8()); // �
I was expecting an exception of some kind (though I could very will be incorrect with my “invalid” UTF-8). Is there a defined behaviour in this case?
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
How to convert Strings to and from UTF8 byte arrays in Java
The docs do state: "The behavior of this method when this string cannot be encoded in the given charset is unspecified. The CharsetEncoder...
Read more >Haskell with UTF-8 - Serokell
hGetContents: invalid argument (invalid byte sequence) ... Nowadays we have UTF-8, which can encode all of Unicode, and we also have UTF-16 ...
Read more >A byte string library for Rust - Andrew Gallant's Blog
Byte strings optimistically assume your strings are UTF-8 and deal with invalid UTF-8 by defining some reasonable behavior on all of its APIs ......
Read more >Validating UTF-8 bytes (Java edition) - Daniel Lemire's blog
We send and receive bytes over the network all the time. If you know that the bytes you are receiving form a string,...
Read more >Strings, bytes, runes and characters in Go
Because some of the bytes in our sample string are not valid ASCII, not even valid UTF-8, printing the string directly will produce...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Not really. The replacement character is a valid codepoint that can legitimately be part of a string. See for example @swankjesse’s comment above.
You can check with Okio by encoding and decoding and comparing. Valid UTF-8 will roundtrip without changes; invalid UTF-8 won’t.