question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't understand "word_delimeter" token filter

See original GitHub issue

elasticsearch-dsl==6.2.1 elasticsearch==6.3.1

I have analyzer:

test_analyzer = analyzer(
    'test_analyzer',
    tokenizer=tokenizer('trigram', 'edge_ngram', min_gram=3, max_gram=10),
    filter=['lowercase', 'word_delimiter']
)

This analyzer used in Document:

class TestIndex(Document):
    name = Text(analyzer=test_analyzer)
    id = Integer()

    class Index:
        name = 'test-index'

Objects in my Index have next pattern in name:

word1:word2:word3:word4

(there a lot of “:” in name)

As I understood ES docs, with this analyzer I can search my objects by sub-words (like word2 from my example), but really search works only with almost-full-name in query.

My Search request is:

search = Search(
    index='test-index'
).query(
    "multi_match",
    query="word2",
    fields=['name'],
    fuzziness='AUTO'
)

(returns nothing)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
honzakralcommented, Nov 19, 2018

The problem is with your tokenizer which produces just edge ngrams of length 3-10 from the original input (word1:word2:word3), in your case:

wor
word
word1
word1:
word1:w
word1:wo
word1:wor
word1:word

which is not particularly useful I believe. I would recommend you play around with the _analyze API (0) to find an analyzer that does what you want, in this case I believe you want the simple_pattern_split tokenizer (1) instead:

es.indices.analyze(body={
    'text': 'word1:word2:word3',
    'tokenizer': tokenizer('split_words', 'simple_pattern_split', pattern=':').get_definition()
})

Hope this helps!

0 - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html 1 - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html

1reaction
honzakralcommented, Nov 19, 2018

can you just print out the result instead of the assert? My guess would be that the code is accidentally run twice so the document is duplicated. Or just print out the hits instead of just a count().

Read more comments on GitHub >

github_iconTop Results From Across the Web

Word delimiter token filter | Elasticsearch Guide [8.5] | Elastic
The word_delimiter filter was designed to remove punctuation from complex identifiers, such as product IDs or part numbers. For these use cases, we...
Read more >
Elasticsearch and word_delimiter token filter - Stack Overflow
Elasticversion - 5.2. Try the following mappings. PUT demo { "settings": { "analysis": { "analyzer": { "index_analyzer_v1" : { "tokenizer" ...
Read more >
Filter Descriptions | Apache Solr Reference Guide 8.8.2
This filter reconstructs hyphenated words that have been tokenized as two tokens because of a line break or other intervening whitespace in the...
Read more >
Filtering Items Using a String Comparison - Microsoft Learn
For example, in the DASL filter string that filters for the Subject property being equal to the word can't , the entire filter...
Read more >
Chapter 4. Taming tokens - Relevant Search
To accomplish this, the relevance engineer must understand the information in the ... This filter takes the word tokens from the previous analysis...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found