question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adding a boost to a SearchField does nothing at query time in Elasticsearch

See original GitHub issue

Issue Summary

The boost option as documented here does not increase a field’s relevance score in query responses from the built-in Elasticsearch functionality.

Field mappings are correctly assigned the specified boost level; however, this has no effect at query time because all SearchFields are added to the _all_text field via copy_to at index time. Since fields in a copy_to do not retain their boost values, the results are not weighted as expected.

Steps to Reproduce

  1. A fresh Wagtail project using an Elasticsearch backend
  2. Create test app and add the following to models.py:
class MyPage(Page):
    lesser_field = models.CharField(max_length=32, blank=True)
    important_field = models.CharField(max_length=32, blank=True)

    search_fields = Page.search_fields + [
        index.SearchField('lesser_field', boost=0.5),
        index.SearchField('important_field', boost=100),
    ]
  1. Make migrations, apply migrations, update index
  2. Create two MyPage instances: MyPage(title="this should be lower", lesser_field="test") MyPage(title="this should be higher", important_field="test")
  3. Save the pages
  4. ./manage.py shell and insert the following:
from apps.test.models import MyPage
from wagtail.search.backends import get_search_backend

backend = get_search_backend()
results = backend.search("test", MyPage)

And results will return: <SearchResults [<MyPage: this one should be lower>, <MyPage: this one should be higher>]> Even though results.query_compiler.order_by_relevance is True, we are not getting the page with the higher boost returned first.

For further confirmation, insert a breakpoint in wagtail.search.backends.elasticsearch2._do_search and view the response from ES:

  "took": 7,
  "timed_out": "False",
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 6.2325726,
    "hits": [
      {
        "_index": "wagtail__wagtailcore_page",
        "_type": "doc",
        "_id": "1285",
        "_score": 6.2325726,
        "fields": {
          "pk": [
            "2"
          ]
        }
      },
      {
        "_index": "wagtail__wagtailcore_page",
        "_type": "doc",
        "_id": "1286",
        "_score": 6.2325726,
        "fields": {
          "pk": [
            "3"
          ]
        }
      }
    ]
  }
}

The boost field has had no effect on the query, the scores are the same. I believe this is because the compiled query is searching on the _all_text field which does not store the boost value:

{
  "query": {
    "bool": {
      "must": {
        "multi_match": {
          "query": "test",
          "fields": [
            "_all_text",
            "_edgengrams"
          ]
        }
      },
      "filter": {
        "match": {
          "content_type": "test.MyPage"
        }
      }
    }
  }
}

And, just to confirm the boost is being correctly added to the field: curl -X GET 'localhost:9200/wagtail__wagtailcore_page/_mapping' | python3 -m json.tool

...
  "test_mypage__important_field": {
    "type": "text",
      "boost": 100.0,
      "copy_to": [
        "_all_text"
      ]
  },
  "test_mypage__lesser_field": {
    "type": "text",
    "boost": 0.5,
    "copy_to": [
      "_all_text"
    ]
  },
...

I don’t believe this is considered expected behavior, as the documentation says:

boost (int/float) - This allows you to set fields as being more important than others. Setting this to a high number on a field will cause pages with matches in that field to be ranked higher.

I would expect that adding a boost to a field would manipulate the relevance scoring accordingly in the default ES implementation.

  • I have confirmed that this issue can be reproduced as described on a fresh Wagtail project: yes

Technical details

  • Python version: 3.6.7.
  • Django version: 2.2.2
  • Wagtail version: 2.5.1
  • Elasticsearch version: 6.3.0

Please let me know if you need any more information or clarification. Thanks!

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
kaedrohocommented, Jul 10, 2019

For anyone who’s interested in this, it looks like Lucene has removed support for boosting in the normalisation factor: https://github.com/apache/lucene-solr/commit/8ed2b764ed4d4d5203b5df1e16fdc1ffd640322c#diff-c4fe039a2fc01f762ea596b25d4a6bc0

The reason for this is supposedly “poor precision” as the normalisation factors are stored as single byte floats: https://issues.apache.org/jira/browse/LUCENE-6819

I strongly disagree with this decision though, it’s precise enough for our needs as people usually boost things by 2x or 10x. not 1.01x. And nobody cares about how precise the actual scores are, as long as their documents are ranked in the order they expect…

So we now have to do much more work at query time, which will certainly impact search performance for the sake of getting some extra decimal points of precision that we don’t need.

Even PostgreSQL has ranking options for indexed terms (which has been enough for every use case I’ve come across), and Elasticsearch no longer has this at all. Way to go!

2reactions
gasmancommented, Jun 17, 2020

Just been stung by this too…

I’m not sure that @ncryer’s approach above will work, because if we’re running a query through Page.search('...'), then the list of fields we obtain through self.mapping.get_mapping() will only include the core Page fields, not fields defined on specific page subclasses (which do get copied into _all_fields).

I think the best way around this would be to scan through all models to find all distinct boost values in use, then define an all_fields field for each one (_all_fields_1, _all_fields_1.5, _all_fields_2 or whatever) and set up copy_to appropriately to funnel data into the right bucket. Then, at query time, we query against those fields, with the corresponding query-time boost applied to each one. If I’m not mistaken, the Postgres backend has to do something very similar, so there’s probably shared code we can refactor out here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

boost | Elasticsearch Guide [7.17] | Elastic
Individual fields can be boosted automatically — count more towards the relevance score — at query time, with the boost parameter as follows:....
Read more >
Query string query | Elasticsearch Guide [8.5] | Elastic
The default boost value is 1, but can be any positive floating point number. Boosts between 0 and 1 reduce relevance. Boosts can...
Read more >
Retrieve selected fields from a search | Elasticsearch Guide [8.5]
Using object notation, you can pass a format argument to customize the format of returned date or geospatial values. POST my-index-000001/_search { "query": ......
Read more >
Boosting query | Elasticsearch Guide [8.5] | Elastic
Returns documents matching a positive query while reducing the relevance score of documents that also match a negative query. You can use the...
Read more >
Elasticsearch searching across fields with boosting and ...
The first is with cross_fields multi-match query. This allows for searching across multiple fields as one big field with the ability to boost...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found