question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

End offset for compatibility characters is not incremented with ICUNormalizer2CharFilter

See original GitHub issue

Description

This issue comes from https://github.com/elastic/elasticsearch/issues/50008. When tokenizing combining characters (ex. ) after applying the char filter icu_normalizer, end offset of combining character is not incremented correctly.

The test which I added in TestICUNormalizer2CharFilter failed.

public void testTokenStreamCombiningCharacter() throws IOException {
  String input = "日日㋀日"; // ㋀ is the combining character
  CharFilter reader =
      new ICUNormalizer2CharFilter(
          new StringReader(input),
          Normalizer2.getInstance(null, "nfkc_cf", Normalizer2.Mode.COMPOSE));

  Tokenizer tokenStream =
      new ICUTokenizer(newAttributeFactory(), new DefaultICUTokenizerConfig(false, true));
  tokenStream.setReader(reader);

  assertTokenStreamContents(
      tokenStream,
      new String[] {"日", "日", "1", "月", "日"},
      new int[] {0, 1, 2, 3, 4}, // test pass if changed to {0, 1, 2, 2, 3}
      new int[] {1, 2, 3, 4, 5}, // test pass if changed to {1, 2, 2, 3, 4} (end offset for the word `1` is not incremented)
      input.length());
}
$ ./gradlew test --tests org.apache.lucene.analysis.icu.TestICUNormalizer2CharFilter.testTokenStreamCombiningCharacter
org.apache.lucene.analysis.icu.TestICUNormalizer2CharFilter > testTokenStreamCombiningCharacter FAILED
    java.lang.AssertionError: endOffset 2 term=1 expected:<3> but was:<2>

Version and environment details

  • macOS 12.3.1
  • openjdk 17.0.5

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
rmuircommented, Nov 26, 2022

yes, normally composed/decomposed (NFC vs NFD) does not change tokenization. so you may do it before or after, doesn’t matter.

but compatibility characters like this don’t really work well in unicode for text processing: they are just really for compatibility/round-trip. you have to apply NFKC/D first before you can really do anything with them. Maybe for now, normalize documents before you send them to elasticsearch.

0reactions
rmuircommented, Nov 27, 2022

I debugged the issue, the problem is not this particular charfilter, instead the issue impacts all charfilters.

Think about this single-character string: “㋀” Our charfilter turns it into two characters: “1” and “月” we would expect the offsets to look like this:

first token "1" at rawStartOffset=0, rawEndOffset=1 -> startOffset=0, endOffset=1
  correctOffset(0) -> 0
  correctOffset(1) -> 1
second token "月" at rawStartOffset=1, rawEndOffset=2 -> startOffset=0, endOffset=1
  correctOffset(1) -> 0
  correctOffset(2) -> 1

As you can see, the bug is in the whole charfilter api of “correctOffset”. Because we need correctOffset(1) -> 1 for the endoffset of the first token, but we need correctOffset(1) -> 0 for the start offset of the second token.

I can’t see any way to fix this, without fixing actual charfilter api (e.g. supporting two separate methods: correctStartOffset() and correctEndOffset())

Sorry for the bad example/explanation. Another example would be a charfilter that converts æ to ae. a’s endoffset of 1 needs to remain 1 after correction, but e’s startoffset of 1 needs to be corrected to a 0.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · apache/lucene - GitHub
End offset for compatibility characters is not incremented with ICUNormalizer2CharFilter type:bug. #11976 opened 18 days ago by maomao905.
Read more >
Class TokenStream | Apache Lucene.NET 4.8.0 Documentation
This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of...
Read more >
Lucene startOffSet should greater than endOffSet error
Your synonym filter may be creating position increments of 0, and ShingleFilter does not handle 0 position increments. (See Lucene 3475).
Read more >
CHANGES.txt - lucene-solr - Google Git
LUCENE-9631: Properly override slice() on subclasses of OffsetRange. (Dawid Weiss) ... not set position increment in end() (Alan Woodward).
Read more >
Solr: lucene/CHANGES.txt - Fossies
(Patrick Zhai) 151 152 * LUCENE-9177: ICUNormalizer2CharFilter no longer ... order tokens at the same position by endOffset to 505 emit longer tokens...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found