question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Behaviour of Bytestring.utf8() with bytes that aren't valid UTF-8

See original GitHub issue

Is there a defined behaviour for ByteString.utf8(...) for sequences of bytes which are not valid UTF-8?

I’ve tried the following:

bytes[] invalidUtf8 = {(byte)0xFF}; // 0xFF will never appear in valid UTF-8
ByteString b = ByteString.of(invalidUtf8);
System.out.println(b.utf8()); // �

I was expecting an exception of some kind (though I could very will be incorrect with my “invalid” UTF-8). Is there a defined behaviour in this case?

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
cketticommented, Oct 25, 2018

Is it a safe assumption that if the result of ByteString.utf8() contains the unicode replacement character then the underlying bytes are not valid UTF-8?

Not really. The replacement character is a valid codepoint that can legitimately be part of a string. See for example @swankjesse’s comment above.

0reactions
swankjessecommented, Apr 12, 2019

You can check with Okio by encoding and decoding and comparing. Valid UTF-8 will roundtrip without changes; invalid UTF-8 won’t.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to convert Strings to and from UTF8 byte arrays in Java
The docs do state: "The behavior of this method when this string cannot be encoded in the given charset is unspecified. The CharsetEncoder...
Read more >
Haskell with UTF-8 - Serokell
hGetContents: invalid argument (invalid byte sequence) ... Nowadays we have UTF-8, which can encode all of Unicode, and we also have UTF-16 ...
Read more >
A byte string library for Rust - Andrew Gallant's Blog
Byte strings optimistically assume your strings are UTF-8 and deal with invalid UTF-8 by defining some reasonable behavior on all of its APIs ......
Read more >
Validating UTF-8 bytes (Java edition) - Daniel Lemire's blog
We send and receive bytes over the network all the time. If you know that the bytes you are receiving form a string,...
Read more >
Strings, bytes, runes and characters in Go
Because some of the bytes in our sample string are not valid ASCII, not even valid UTF-8, printing the string directly will produce...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found