Can't understand "word_delimeter" token filter
See original GitHub issueelasticsearch-dsl==6.2.1 elasticsearch==6.3.1
I have analyzer:
test_analyzer = analyzer(
'test_analyzer',
tokenizer=tokenizer('trigram', 'edge_ngram', min_gram=3, max_gram=10),
filter=['lowercase', 'word_delimiter']
)
This analyzer used in Document:
class TestIndex(Document):
name = Text(analyzer=test_analyzer)
id = Integer()
class Index:
name = 'test-index'
Objects in my Index have next pattern in name:
word1:word2:word3:word4
(there a lot of “:” in name)
As I understood ES docs, with this analyzer I can search my objects by sub-words (like word2
from my example), but really search works only with almost-full-name in query.
My Search request is:
search = Search(
index='test-index'
).query(
"multi_match",
query="word2",
fields=['name'],
fuzziness='AUTO'
)
(returns nothing)
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Word delimiter token filter | Elasticsearch Guide [8.5] | Elastic
The word_delimiter filter was designed to remove punctuation from complex identifiers, such as product IDs or part numbers. For these use cases, we...
Read more >Elasticsearch and word_delimiter token filter - Stack Overflow
Elasticversion - 5.2. Try the following mappings. PUT demo { "settings": { "analysis": { "analyzer": { "index_analyzer_v1" : { "tokenizer" ...
Read more >Filter Descriptions | Apache Solr Reference Guide 8.8.2
This filter reconstructs hyphenated words that have been tokenized as two tokens because of a line break or other intervening whitespace in the...
Read more >Filtering Items Using a String Comparison - Microsoft Learn
For example, in the DASL filter string that filters for the Subject property being equal to the word can't , the entire filter...
Read more >Chapter 4. Taming tokens - Relevant Search
To accomplish this, the relevance engineer must understand the information in the ... This filter takes the word tokens from the previous analysis...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The problem is with your
tokenizer
which produces just edge ngrams of length 3-10 from the original input (word1:word2:word3
), in your case:which is not particularly useful I believe. I would recommend you play around with the
_analyze
API (0) to find an analyzer that does what you want, in this case I believe you want thesimple_pattern_split
tokenizer (1) instead:Hope this helps!
0 - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html 1 - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html
can you just
print
out the result instead of theassert
? My guess would be that the code is accidentally run twice so the document is duplicated. Or just print out the hits instead of just acount()
.