Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't understand "word_delimeter" token filter

See original GitHub issue

elasticsearch-dsl==6.2.1 elasticsearch==6.3.1

I have analyzer:

test_analyzer = analyzer(
    'test_analyzer',
    tokenizer=tokenizer('trigram', 'edge_ngram', min_gram=3, max_gram=10),
    filter=['lowercase', 'word_delimiter']
)

This analyzer used in Document:

class TestIndex(Document):
    name = Text(analyzer=test_analyzer)
    id = Integer()

    class Index:
        name = 'test-index'

Objects in my Index have next pattern in name:

word1:word2:word3:word4

(there a lot of “:” in name)

As I understood ES docs, with this analyzer I can search my objects by sub-words (like word2 from my example), but really search works only with almost-full-name in query.

My Search request is:

search = Search(
    index='test-index'
).query(
    "multi_match",
    query="word2",
    fields=['name'],
    fuzziness='AUTO'
)

(returns nothing)

Issue Analytics

State:
Created 5 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

2reactions

honzakralcommented, Nov 19, 2018

The problem is with your tokenizer which produces just edge ngrams of length 3-10 from the original input (word1:word2:word3), in your case:

wor
word
word1
word1:
word1:w
word1:wo
word1:wor
word1:word

which is not particularly useful I believe. I would recommend you play around with the _analyze API (0) to find an analyzer that does what you want, in this case I believe you want the simple_pattern_split tokenizer (1) instead:

es.indices.analyze(body={
    'text': 'word1:word2:word3',
    'tokenizer': tokenizer('split_words', 'simple_pattern_split', pattern=':').get_definition()
})

Hope this helps!

0 - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html 1 - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html

1reaction

honzakralcommented, Nov 19, 2018

can you just print out the result instead of the assert? My guess would be that the code is accidentally run twice so the document is duplicated. Or just print out the hits instead of just a count().

Top Results From Across the Web

Word delimiter token filter | Elasticsearch Guide [8.5] | Elastic

The word_delimiter filter was designed to remove punctuation from complex identifiers, such as product IDs or part numbers. For these use cases, we...

Elasticsearch and word_delimiter token filter - Stack Overflow

Elasticversion - 5.2. Try the following mappings. PUT demo { "settings": { "analysis": { "analyzer": { "index_analyzer_v1" : { "tokenizer" ...

Filter Descriptions | Apache Solr Reference Guide 8.8.2

This filter reconstructs hyphenated words that have been tokenized as two tokens because of a line break or other intervening whitespace in the...

Filtering Items Using a String Comparison - Microsoft Learn

For example, in the DASL filter string that filters for the Subject property being equal to the word can't , the entire filter...

Chapter 4. Taming tokens - Relevant Search

To accomplish this, the relevance engineer must understand the information in the ... This filter takes the word tokens from the previous analysis...