Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handle codepoints

See original GitHub issue

Codepoints are more and more present rather than chars. Working with them is hard because the Java API didn’t really think about them and we’re left with a handful of methods, not even in the same place.

Something that would be nice to start from somewhere is a kind of CodepointStream, and then expand on that. I’m not talking about String::codepoints, but rather about a new kind of Reader:

abstract class CodepointStream implements Closeable {
  abstract int read() throws IOException;
  abstract int read(int[] buffer);
}
class ReaderCodepointStream extends CodepointStream {
  private final Reader delegate;
  ReaderCodepointStream(Reader reader) { delegate = requireNonNull(reader); }
  int read() {
    int high = delegate.read();
    if (high == -1 || !Character.isHighSurrogate((char) high)) {
      return high;
    }
    int low = delegate.read();
    if (low == -1 || !Character.isLowSurrogate((char) low)) {
      throw new IOException("Invalid surrogate pair");
    }
    return Character.toCodePoint((char) high, (char) low);
  }
  int read(int[] buffer) {
    // Implement as efficiently as possible, merging characters when a high/low pair is encountered.
  }
  void close() { reader.close(); }
}

And maybe later extend this with tool objects like CodepointSource/Sink?

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

cpovirkcommented, Feb 22, 2021

As a general rule, we have tried to stay away from internationalization:

That’s not because it’s unimportant but rather because it’s too important to be handled by non-experts such us as.
Any real internationalization requires an ICU4J dependency. And, yes, I know that we’re already not dependency-free (and I know this has caused people pain), but we are trying to be deliberate about growing our scope in ways that would require runtime dependencies.

That said, we do occasionally try to provide a baseline level of not being gratuitously incompatible with internationalization, like how Strings.commonPrefix won’t break in the middle of a surrogate pair. But that’s hardly true internationalization, and I wonder sometimes if it’s actually worse for us to have “partial” Unicode handling than to have none at all.

CodepointStream probably falls into that “partial” bucket. Probably it’s a little better than commonPrefix, though, since it at least lets users operate on the code points however they want, rather than directly splitting them up in a potentially incorrect way. I don’t think it will be a priority for us, but it doesn’t immediately strike me as so obviously out of scope that I feel obligated to close the issue entirely 😃

(FWIW we do have an internal API that uses BreakIterator to present an Iterable<String> view of an input text, broken by characters, lines, sentences, or words. I suspect that there are many cases in which such convenience wrappers around ICU4J could be helpful. (In fact, I can think of some wrappers for date-time formatting that we have, too.) That much is out of scope for Guava, but ideally such a thing would exist somewhere.)

1reaction

jbduncancommented, Feb 20, 2021

Cool! You can use ICU4J’s BreakIterator to get the words and grapheme clusters in your strings.

(I don’t really understand why those methods have different versions that either do or don’t accept a locale. By comparison, Rust’s unicode-segmentation crate, which does the same thing as BreakIterator, doesn’t accept the Rust equivalent of a locale.)

However BreakIterator is a bit hard to use, so you may have a better time if either after using it, you store the parsed characters/words in List<String>s, or you wrap usages of BreakIterator in a custom lazy Iterable<String>.