Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Behaviour of Bytestring.utf8() with bytes that aren't valid UTF-8

See original GitHub issue

Is there a defined behaviour for ByteString.utf8(...) for sequences of bytes which are not valid UTF-8?

I’ve tried the following:

bytes[] invalidUtf8 = {(byte)0xFF}; // 0xFF will never appear in valid UTF-8
ByteString b = ByteString.of(invalidUtf8);
System.out.println(b.utf8()); // �

I was expecting an exception of some kind (though I could very will be incorrect with my “invalid” UTF-8). Is there a defined behaviour in this case?

Issue Analytics

State:
Created 5 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

cketticommented, Oct 25, 2018

Is it a safe assumption that if the result of ByteString.utf8() contains the unicode replacement character then the underlying bytes are not valid UTF-8?

Not really. The replacement character is a valid codepoint that can legitimately be part of a string. See for example @swankjesse’s comment above.

0reactions

swankjessecommented, Apr 12, 2019

You can check with Okio by encoding and decoding and comparing. Valid UTF-8 will roundtrip without changes; invalid UTF-8 won’t.

Top Results From Across the Web

How to convert Strings to and from UTF8 byte arrays in Java

The docs do state: "The behavior of this method when this string cannot be encoded in the given charset is unspecified. The CharsetEncoder...

Haskell with UTF-8 - Serokell

hGetContents: invalid argument (invalid byte sequence) ... Nowadays we have UTF-8, which can encode all of Unicode, and we also have UTF-16 ...

A byte string library for Rust - Andrew Gallant's Blog

Byte strings optimistically assume your strings are UTF-8 and deal with invalid UTF-8 by defining some reasonable behavior on all of its APIs ......

Validating UTF-8 bytes (Java edition) - Daniel Lemire's blog

We send and receive bytes over the network all the time. If you know that the bytes you are receiving form a string,...

Strings, bytes, runes and characters in Go

Because some of the bytes in our sample string are not valid ASCII, not even valid UTF-8, printing the string directly will produce...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Behaviour of Bytestring.utf8() with bytes that aren't valid UTF-8

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

java.lang.ClassCastException: okio.ByteString cannot be cast to java.lang.Comparable

Okio 2.1.0 / Proguard issue