question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handle codepoints

See original GitHub issue

Codepoints are more and more present rather than chars. Working with them is hard because the Java API didn’t really think about them and we’re left with a handful of methods, not even in the same place.

Something that would be nice to start from somewhere is a kind of CodepointStream, and then expand on that. I’m not talking about String::codepoints, but rather about a new kind of Reader:

abstract class CodepointStream implements Closeable {
  abstract int read() throws IOException;
  abstract int read(int[] buffer);
}
class ReaderCodepointStream extends CodepointStream {
  private final Reader delegate;
  ReaderCodepointStream(Reader reader) { delegate = requireNonNull(reader); }
  int read() {
    int high = delegate.read();
    if (high == -1 || !Character.isHighSurrogate((char) high)) {
      return high;
    }
    int low = delegate.read();
    if (low == -1 || !Character.isLowSurrogate((char) low)) {
      throw new IOException("Invalid surrogate pair");
    }
    return Character.toCodePoint((char) high, (char) low);
  }
  int read(int[] buffer) {
    // Implement as efficiently as possible, merging characters when a high/low pair is encountered.
  }
  void close() { reader.close(); }
}

And maybe later extend this with tool objects like CodepointSource/Sink?

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
cpovirkcommented, Feb 22, 2021

As a general rule, we have tried to stay away from internationalization:

  • That’s not because it’s unimportant but rather because it’s too important to be handled by non-experts such us as.
  • Any real internationalization requires an ICU4J dependency. And, yes, I know that we’re already not dependency-free (and I know this has caused people pain), but we are trying to be deliberate about growing our scope in ways that would require runtime dependencies.

That said, we do occasionally try to provide a baseline level of not being gratuitously incompatible with internationalization, like how Strings.commonPrefix won’t break in the middle of a surrogate pair. But that’s hardly true internationalization, and I wonder sometimes if it’s actually worse for us to have “partial” Unicode handling than to have none at all.

CodepointStream probably falls into that “partial” bucket. Probably it’s a little better than commonPrefix, though, since it at least lets users operate on the code points however they want, rather than directly splitting them up in a potentially incorrect way. I don’t think it will be a priority for us, but it doesn’t immediately strike me as so obviously out of scope that I feel obligated to close the issue entirely 😃

(FWIW we do have an internal API that uses BreakIterator to present an Iterable<String> view of an input text, broken by characters, lines, sentences, or words. I suspect that there are many cases in which such convenience wrappers around ICU4J could be helpful. (In fact, I can think of some wrappers for date-time formatting that we have, too.) That much is out of scope for Guava, but ideally such a thing would exist somewhere.)

1reaction
jbduncancommented, Feb 20, 2021

Cool! You can use ICU4J’s BreakIterator to get the words and grapheme clusters in your strings.

(I don’t really understand why those methods have different versions that either do or don’t accept a locale. By comparison, Rust’s unicode-segmentation crate, which does the same thing as BreakIterator, doesn’t accept the Rust equivalent of a locale.)

However BreakIterator is a bit hard to use, so you may have a better time if either after using it, you store the parsed characters/words in List<String>s, or you wrap usages of BreakIterator in a custom lazy Iterable<String>.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How should a UTF-8 decoder handle invalid codepoints ...
I'm writing a UTF-8 decoder, and I don't know how to handle invalid codepoints correctly: surrogates; codepoints larger than 0x10ffff.
Read more >
Java and handling of UTF-8 codepoints in the ... - Github-Gist
Java and handling of UTF-8 codepoints in the supplementary range - UTF8.java. ... doesn't equals the "real" length, that is: the number of...
Read more >
Unicode and You - BetterExplained
Unicode labeled each abstract character with a “code point”. ... At a base level, this can handle codepoints 0x0000 to 0xFFFF, or 0-65535...
Read more >
Let's Stop Ascribing Meaning to Code Points
All the APIs by default deal with EGCs. The length of a string is the number of EGCs in it. They are indexed...
Read more >
Unicode HOWTO — Python 3.11.1 documentation
Encodings¶ · It can handle any Unicode code point. · A Unicode string is turned into a sequence of bytes that contains embedded...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found