One character is missing in class ASCIIFoldingFilter
See original GitHub issueI think one character in class ASCIIFoldingFilter is missing Character: Ʀ Nº: 422 UTF-16: 01A6
Source code that might need to be added to method FoldToASCII(char[] input, int inputPos, char[] output, int outputPos, int length):
case '\u01A6': // Ʀ [LATIN LETTER YR]
output[outputPos++] = 'R';
Links about this character:
https://codepoints.net/U+01A6
https://en.wikipedia.org/wiki/%C6%A6
Issue Analytics
- State:
- Created 2 years ago
- Comments:15 (7 by maintainers)
Top Results From Across the Web
Ignoring specific characters with Elasticsearch asciifolding
It let's you do everything that the AsciiFolding filter does, ... you to ignore a range of characters through the unicodeSetFilter property.
Read more >Configurable ASCIIFolding and CharReplace filters done
So I came up with the solution to modify the standard Lucene ASCIFolding filter and have it ignore some configurable characters.
Read more >Character Folding · Elastic Search Definitive Guide
The icu_folding token filter (provided by the icu plug-in) does the same job as the asciifolding filter, but extends the transformation to scripts...
Read more >asciifolding
Rust port of Lucene's Ascii folding filter. From Lucene documentation: This class converts alphabetic, numeric, and symbolic Unicode characters which are ...
Read more >Language Analysis | Apache Solr Reference Guide 8.7
ASCII Folding. This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks for the report.
As this is a line-by-line port from Java Lucene 4.8.0 (for the most part), we have faithfully reproduced the ASCIIFoldingFilter in its entirety. While we have admittedly included some patches from later versions of Lucene where they affect usability (for example,
Lucene.Net.Analysis.Common
all came from 4.8.1), the change you are suggesting isn’t even reflected in the ASCIIFoldingFilter in the latest commit.If you wish to pursue adding more characters to
ASCIIFoldingFilter
, I suggest you take it up with the Lucene design team on their dev mailing list.However, do note this isn’t the only filter included in the box that is capable of removing diacritics from ASCII characters. Some alternatives:
Note that you can also create a custom folding filter by using a similar approach in the ICUFoldingFilter implementation (ported from Lucene 7.1.0). There is a tool you can port to generate a
.nrm
binary file from modified versions of these text files. The.nrm
file can then be provided to the constructor ofICU4N.Text.Normalizer2
- more about the data format can be found in the ICU normalization docs. Note that the.nrm
file is the same binary format used in C++ and Java.Alternatively, if you wish to extend the
ASCIIFoldingFilter
with your own custom brew of characters, you can simply chain your own filter toASCIIFoldingFilter
as pointed out in this article.FYI - there is also another demo showing additional ways to build analyzers here: https://github.com/NightOwl888/LuceneNetDemo