Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

End offset for compatibility characters is not incremented with ICUNormalizer2CharFilter

See original GitHub issue

Description

This issue comes from https://github.com/elastic/elasticsearch/issues/50008. When tokenizing combining characters (ex. ㋀) after applying the char filter icu_normalizer, end offset of combining character is not incremented correctly.

The test which I added in TestICUNormalizer2CharFilter failed.

public void testTokenStreamCombiningCharacter() throws IOException {
  String input = "日日㋀日"; // ㋀ is the combining character
  CharFilter reader =
      new ICUNormalizer2CharFilter(
          new StringReader(input),
          Normalizer2.getInstance(null, "nfkc_cf", Normalizer2.Mode.COMPOSE));

  Tokenizer tokenStream =
      new ICUTokenizer(newAttributeFactory(), new DefaultICUTokenizerConfig(false, true));
  tokenStream.setReader(reader);

  assertTokenStreamContents(
      tokenStream,
      new String[] {"日", "日", "1", "月", "日"},
      new int[] {0, 1, 2, 3, 4}, // test pass if changed to {0, 1, 2, 2, 3}
      new int[] {1, 2, 3, 4, 5}, // test pass if changed to {1, 2, 2, 3, 4} (end offset for the word `1` is not incremented)
      input.length());
}

$ ./gradlew test --tests org.apache.lucene.analysis.icu.TestICUNormalizer2CharFilter.testTokenStreamCombiningCharacter
org.apache.lucene.analysis.icu.TestICUNormalizer2CharFilter > testTokenStreamCombiningCharacter FAILED
    java.lang.AssertionError: endOffset 2 term=1 expected:<3> but was:<2>

Version and environment details

macOS 12.3.1
openjdk 17.0.5

Issue Analytics

State:
Created 10 months ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

rmuircommented, Nov 26, 2022

yes, normally composed/decomposed (NFC vs NFD) does not change tokenization. so you may do it before or after, doesn’t matter.

but compatibility characters like this don’t really work well in unicode for text processing: they are just really for compatibility/round-trip. you have to apply NFKC/D first before you can really do anything with them. Maybe for now, normalize documents before you send them to elasticsearch.

0reactions

rmuircommented, Nov 27, 2022

I debugged the issue, the problem is not this particular charfilter, instead the issue impacts all charfilters.

Think about this single-character string: “㋀” Our charfilter turns it into two characters: “1” and “月” we would expect the offsets to look like this:

first token "1" at rawStartOffset=0, rawEndOffset=1 -> startOffset=0, endOffset=1
  correctOffset(0) -> 0
  correctOffset(1) -> 1
second token "月" at rawStartOffset=1, rawEndOffset=2 -> startOffset=0, endOffset=1
  correctOffset(1) -> 0
  correctOffset(2) -> 1

As you can see, the bug is in the whole charfilter api of “correctOffset”. Because we need correctOffset(1) -> 1 for the endoffset of the first token, but we need correctOffset(1) -> 0 for the start offset of the second token.

I can’t see any way to fix this, without fixing actual charfilter api (e.g. supporting two separate methods: correctStartOffset() and correctEndOffset())

Sorry for the bad example/explanation. Another example would be a charfilter that converts æ to ae. a’s endoffset of 1 needs to remain 1 after correction, but e’s startoffset of 1 needs to be corrected to a 0.

Top Results From Across the Web

Issues · apache/lucene - GitHub

End offset for compatibility characters is not incremented with ICUNormalizer2CharFilter type:bug. #11976 opened 18 days ago by maomao905.

Class TokenStream | Apache Lucene.NET 4.8.0 Documentation

This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of...

Lucene startOffSet should greater than endOffSet error

Your synonym filter may be creating position increments of 0, and ShingleFilter does not handle 0 position increments. (See Lucene 3475).

CHANGES.txt - lucene-solr - Google Git

LUCENE-9631: Properly override slice() on subclasses of OffsetRange. (Dawid Weiss) ... not set position increment in end() (Alan Woodward).

Solr: lucene/CHANGES.txt - Fossies

(Patrick Zhai) 151 152 * LUCENE-9177: ICUNormalizer2CharFilter no longer ... order tokens at the same position by endOffset to 505 emit longer tokens...