End offset for compatibility characters is not incremented with ICUNormalizer2CharFilter
See original GitHub issueDescription
This issue comes from https://github.com/elastic/elasticsearch/issues/50008.
When tokenizing combining characters (ex. ㋀
) after applying the char filter icu_normalizer
, end offset of combining character is not incremented correctly.
The test which I added in TestICUNormalizer2CharFilter failed.
public void testTokenStreamCombiningCharacter() throws IOException {
String input = "日日㋀日"; // ㋀ is the combining character
CharFilter reader =
new ICUNormalizer2CharFilter(
new StringReader(input),
Normalizer2.getInstance(null, "nfkc_cf", Normalizer2.Mode.COMPOSE));
Tokenizer tokenStream =
new ICUTokenizer(newAttributeFactory(), new DefaultICUTokenizerConfig(false, true));
tokenStream.setReader(reader);
assertTokenStreamContents(
tokenStream,
new String[] {"日", "日", "1", "月", "日"},
new int[] {0, 1, 2, 3, 4}, // test pass if changed to {0, 1, 2, 2, 3}
new int[] {1, 2, 3, 4, 5}, // test pass if changed to {1, 2, 2, 3, 4} (end offset for the word `1` is not incremented)
input.length());
}
$ ./gradlew test --tests org.apache.lucene.analysis.icu.TestICUNormalizer2CharFilter.testTokenStreamCombiningCharacter
org.apache.lucene.analysis.icu.TestICUNormalizer2CharFilter > testTokenStreamCombiningCharacter FAILED
java.lang.AssertionError: endOffset 2 term=1 expected:<3> but was:<2>
Version and environment details
- macOS 12.3.1
- openjdk 17.0.5
Issue Analytics
- State:
- Created 10 months ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
Issues · apache/lucene - GitHub
End offset for compatibility characters is not incremented with ICUNormalizer2CharFilter type:bug. #11976 opened 18 days ago by maomao905.
Read more >Class TokenStream | Apache Lucene.NET 4.8.0 Documentation
This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of...
Read more >Lucene startOffSet should greater than endOffSet error
Your synonym filter may be creating position increments of 0, and ShingleFilter does not handle 0 position increments. (See Lucene 3475).
Read more >CHANGES.txt - lucene-solr - Google Git
LUCENE-9631: Properly override slice() on subclasses of OffsetRange. (Dawid Weiss) ... not set position increment in end() (Alan Woodward).
Read more >Solr: lucene/CHANGES.txt - Fossies
(Patrick Zhai) 151 152 * LUCENE-9177: ICUNormalizer2CharFilter no longer ... order tokens at the same position by endOffset to 505 emit longer tokens...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
yes, normally composed/decomposed (NFC vs NFD) does not change tokenization. so you may do it before or after, doesn’t matter.
but compatibility characters like this don’t really work well in unicode for text processing: they are just really for compatibility/round-trip. you have to apply NFKC/D first before you can really do anything with them. Maybe for now, normalize documents before you send them to elasticsearch.
I debugged the issue, the problem is not this particular charfilter, instead the issue impacts all charfilters.
Think about this single-character string: “㋀” Our charfilter turns it into two characters: “1” and “月” we would expect the offsets to look like this:
As you can see, the bug is in the whole charfilter api of “correctOffset”. Because we need
correctOffset(1) -> 1
for the endoffset of the first token, but we needcorrectOffset(1) -> 0
for the start offset of the second token.I can’t see any way to fix this, without fixing actual charfilter api (e.g. supporting two separate methods:
correctStartOffset()
andcorrectEndOffset()
)Sorry for the bad example/explanation. Another example would be a charfilter that converts
æ
toae
. a’s endoffset of 1 needs to remain 1 after correction, but e’s startoffset of 1 needs to be corrected to a 0.