Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

One character is missing in class ASCIIFoldingFilter

See original GitHub issue

I think one character in class ASCIIFoldingFilter is missing Character: Ʀ Nº: 422 UTF-16: 01A6

Source code that might need to be added to method FoldToASCII(char[] input, int inputPos, char[] output, int outputPos, int length):

case '\u01A6': // Ʀ  [LATIN LETTER YR] 
output[outputPos++] = 'R';

Links about this character: https://codepoints.net/U+01A6
https://en.wikipedia.org/wiki/%C6%A6

Issue Analytics

State:
Created 2 years ago
Comments:15 (7 by maintainers)

Top GitHub Comments

3reactions

NightOwl888commented, Feb 12, 2022

Thanks for the report.

As this is a line-by-line port from Java Lucene 4.8.0 (for the most part), we have faithfully reproduced the ASCIIFoldingFilter in its entirety. While we have admittedly included some patches from later versions of Lucene where they affect usability (for example, Lucene.Net.Analysis.Common all came from 4.8.1), the change you are suggesting isn’t even reflected in the ASCIIFoldingFilter in the latest commit.

If you wish to pursue adding more characters to ASCIIFoldingFilter, I suggest you take it up with the Lucene design team on their dev mailing list.

However, do note this isn’t the only filter included in the box that is capable of removing diacritics from ASCII characters. Some alternatives:

Note that you can also create a custom folding filter by using a similar approach in the ICUFoldingFilter implementation (ported from Lucene 7.1.0). There is a tool you can port to generate a .nrm binary file from modified versions of these text files. The .nrm file can then be provided to the constructor of ICU4N.Text.Normalizer2 - more about the data format can be found in the ICU normalization docs. Note that the .nrm file is the same binary format used in C++ and Java.

Alternatively, if you wish to extend the ASCIIFoldingFilter with your own custom brew of characters, you can simply chain your own filter to ASCIIFoldingFilter as pointed out in this article.

public TokenStream GetTokenStream(string fieldName, TextReader reader)
{
    TokenStream result = new StandardTokenizer(reader);
    result = new StandardFilter(result);
    result = new LowerCaseFilter(result);
    // etc etc ...
    result = new StopFilter(result, yourSetOfStopWords);
    result = new MyCustomFoldingFilter(result);
    result = new ASCIIFoldingFilter(result);
    return result;
}

2reactions

NightOwl888commented, Mar 3, 2022

FYI - there is also another demo showing additional ways to build analyzers here: https://github.com/NightOwl888/LuceneNetDemo

Top Results From Across the Web

Ignoring specific characters with Elasticsearch asciifolding

It let's you do everything that the AsciiFolding filter does, ... you to ignore a range of characters through the unicodeSetFilter property.

Configurable ASCIIFolding and CharReplace filters done

So I came up with the solution to modify the standard Lucene ASCIFolding filter and have it ignore some configurable characters.

Character Folding · Elastic Search Definitive Guide

The icu_folding token filter (provided by the icu plug-in) does the same job as the asciifolding filter, but extends the transformation to scripts...

asciifolding

Rust port of Lucene's Ascii folding filter. From Lucene documentation: This class converts alphabetic, numeric, and symbolic Unicode characters which are ...

Language Analysis | Apache Solr Reference Guide 8.7

ASCII Folding. This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" ......