Handle codepoints
See original GitHub issueCodepoints are more and more present rather than chars. Working with them is hard because the Java API didn’t really think about them and we’re left with a handful of methods, not even in the same place.
Something that would be nice to start from somewhere is a kind of CodepointStream
, and then expand on that. I’m not talking about String::codepoints
, but rather about a new kind of Reader
:
abstract class CodepointStream implements Closeable {
abstract int read() throws IOException;
abstract int read(int[] buffer);
}
class ReaderCodepointStream extends CodepointStream {
private final Reader delegate;
ReaderCodepointStream(Reader reader) { delegate = requireNonNull(reader); }
int read() {
int high = delegate.read();
if (high == -1 || !Character.isHighSurrogate((char) high)) {
return high;
}
int low = delegate.read();
if (low == -1 || !Character.isLowSurrogate((char) low)) {
throw new IOException("Invalid surrogate pair");
}
return Character.toCodePoint((char) high, (char) low);
}
int read(int[] buffer) {
// Implement as efficiently as possible, merging characters when a high/low pair is encountered.
}
void close() { reader.close(); }
}
And maybe later extend this with tool objects like CodepointSource/Sink?
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
How should a UTF-8 decoder handle invalid codepoints ...
I'm writing a UTF-8 decoder, and I don't know how to handle invalid codepoints correctly: surrogates; codepoints larger than 0x10ffff.
Read more >Java and handling of UTF-8 codepoints in the ... - Github-Gist
Java and handling of UTF-8 codepoints in the supplementary range - UTF8.java. ... doesn't equals the "real" length, that is: the number of...
Read more >Unicode and You - BetterExplained
Unicode labeled each abstract character with a “code point”. ... At a base level, this can handle codepoints 0x0000 to 0xFFFF, or 0-65535...
Read more >Let's Stop Ascribing Meaning to Code Points
All the APIs by default deal with EGCs. The length of a string is the number of EGCs in it. They are indexed...
Read more >Unicode HOWTO — Python 3.11.1 documentation
Encodings¶ · It can handle any Unicode code point. · A Unicode string is turned into a sequence of bytes that contains embedded...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
As a general rule, we have tried to stay away from internationalization:
That said, we do occasionally try to provide a baseline level of not being gratuitously incompatible with internationalization, like how
Strings.commonPrefix
won’t break in the middle of a surrogate pair. But that’s hardly true internationalization, and I wonder sometimes if it’s actually worse for us to have “partial” Unicode handling than to have none at all.CodepointStream
probably falls into that “partial” bucket. Probably it’s a little better thancommonPrefix
, though, since it at least lets users operate on the code points however they want, rather than directly splitting them up in a potentially incorrect way. I don’t think it will be a priority for us, but it doesn’t immediately strike me as so obviously out of scope that I feel obligated to close the issue entirely 😃(FWIW we do have an internal API that uses
BreakIterator
to present anIterable<String>
view of an input text, broken by characters, lines, sentences, or words. I suspect that there are many cases in which such convenience wrappers around ICU4J could be helpful. (In fact, I can think of some wrappers for date-time formatting that we have, too.) That much is out of scope for Guava, but ideally such a thing would exist somewhere.)Cool! You can use ICU4J’s
BreakIterator
to get the words and grapheme clusters in your strings.(I don’t really understand why those methods have different versions that either do or don’t accept a locale. By comparison, Rust’s
unicode-segmentation
crate, which does the same thing asBreakIterator
, doesn’t accept the Rust equivalent of a locale.)However
BreakIterator
is a bit hard to use, so you may have a better time if either after using it, you store the parsed characters/words inList<String>
s, or you wrap usages ofBreakIterator
in a custom lazyIterable<String>
.