question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

One character is missing in class ASCIIFoldingFilter

See original GitHub issue

I think one character in class ASCIIFoldingFilter is missing Character: Ʀ Nº: 422 UTF-16: 01A6

Source code that might need to be added to method FoldToASCII(char[] input, int inputPos, char[] output, int outputPos, int length):

case '\u01A6': // Ʀ  [LATIN LETTER YR] 
output[outputPos++] = 'R';

Links about this character: https://codepoints.net/U+01A6
https://en.wikipedia.org/wiki/%C6%A6

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:15 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
NightOwl888commented, Feb 12, 2022

Thanks for the report.

As this is a line-by-line port from Java Lucene 4.8.0 (for the most part), we have faithfully reproduced the ASCIIFoldingFilter in its entirety. While we have admittedly included some patches from later versions of Lucene where they affect usability (for example, Lucene.Net.Analysis.Common all came from 4.8.1), the change you are suggesting isn’t even reflected in the ASCIIFoldingFilter in the latest commit.

If you wish to pursue adding more characters to ASCIIFoldingFilter, I suggest you take it up with the Lucene design team on their dev mailing list.

However, do note this isn’t the only filter included in the box that is capable of removing diacritics from ASCII characters. Some alternatives:

  1. Nomalizer2Filter
  2. ICUFoldingFilter

Note that you can also create a custom folding filter by using a similar approach in the ICUFoldingFilter implementation (ported from Lucene 7.1.0). There is a tool you can port to generate a .nrm binary file from modified versions of these text files. The .nrm file can then be provided to the constructor of ICU4N.Text.Normalizer2 - more about the data format can be found in the ICU normalization docs. Note that the .nrm file is the same binary format used in C++ and Java.

Alternatively, if you wish to extend the ASCIIFoldingFilter with your own custom brew of characters, you can simply chain your own filter to ASCIIFoldingFilter as pointed out in this article.

public TokenStream GetTokenStream(string fieldName, TextReader reader)
{
    TokenStream result = new StandardTokenizer(reader);
    result = new StandardFilter(result);
    result = new LowerCaseFilter(result);
    // etc etc ...
    result = new StopFilter(result, yourSetOfStopWords);
    result = new MyCustomFoldingFilter(result);
    result = new ASCIIFoldingFilter(result);
    return result;
}
2reactions
NightOwl888commented, Mar 3, 2022

FYI - there is also another demo showing additional ways to build analyzers here: https://github.com/NightOwl888/LuceneNetDemo

Read more comments on GitHub >

github_iconTop Results From Across the Web

Ignoring specific characters with Elasticsearch asciifolding
It let's you do everything that the AsciiFolding filter does, ... you to ignore a range of characters through the unicodeSetFilter property.
Read more >
Configurable ASCIIFolding and CharReplace filters done
So I came up with the solution to modify the standard Lucene ASCIFolding filter and have it ignore some configurable characters.
Read more >
Character Folding · Elastic Search Definitive Guide
The icu_folding token filter (provided by the icu plug-in) does the same job as the asciifolding filter, but extends the transformation to scripts...
Read more >
asciifolding
Rust port of Lucene's Ascii folding filter. From Lucene documentation: This class converts alphabetic, numeric, and symbolic Unicode characters which are ...
Read more >
Language Analysis | Apache Solr Reference Guide 8.7
ASCII Folding. This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found