Adding a boost to a SearchField does nothing at query time in Elasticsearch
See original GitHub issueIssue Summary
The boost
option as documented here does not increase a field’s relevance score in query responses from the built-in Elasticsearch functionality.
Field mappings are correctly assigned the specified boost
level; however, this has no effect at query time because all SearchFields are added to the _all_text
field via copy_to
at index time. Since fields in a copy_to
do not retain their boost
values, the results are not weighted as expected.
Steps to Reproduce
- A fresh Wagtail project using an Elasticsearch backend
- Create
test
app and add the following tomodels.py
:
class MyPage(Page):
lesser_field = models.CharField(max_length=32, blank=True)
important_field = models.CharField(max_length=32, blank=True)
search_fields = Page.search_fields + [
index.SearchField('lesser_field', boost=0.5),
index.SearchField('important_field', boost=100),
]
- Make migrations, apply migrations, update index
- Create two
MyPage
instances:MyPage(title="this should be lower", lesser_field="test")
MyPage(title="this should be higher", important_field="test")
- Save the pages
./manage.py shell
and insert the following:
from apps.test.models import MyPage
from wagtail.search.backends import get_search_backend
backend = get_search_backend()
results = backend.search("test", MyPage)
And results
will return:
<SearchResults [<MyPage: this one should be lower>, <MyPage: this one should be higher>]>
Even though results.query_compiler.order_by_relevance
is True
, we are not getting the page with the higher boost
returned first.
For further confirmation, insert a breakpoint in wagtail.search.backends.elasticsearch2._do_search
and view the response from ES:
"took": 7,
"timed_out": "False",
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 6.2325726,
"hits": [
{
"_index": "wagtail__wagtailcore_page",
"_type": "doc",
"_id": "1285",
"_score": 6.2325726,
"fields": {
"pk": [
"2"
]
}
},
{
"_index": "wagtail__wagtailcore_page",
"_type": "doc",
"_id": "1286",
"_score": 6.2325726,
"fields": {
"pk": [
"3"
]
}
}
]
}
}
The boost
field has had no effect on the query, the scores are the same. I believe this is because the compiled query is searching on the _all_text
field which does not store the boost
value:
{
"query": {
"bool": {
"must": {
"multi_match": {
"query": "test",
"fields": [
"_all_text",
"_edgengrams"
]
}
},
"filter": {
"match": {
"content_type": "test.MyPage"
}
}
}
}
}
And, just to confirm the boost
is being correctly added to the field:
curl -X GET 'localhost:9200/wagtail__wagtailcore_page/_mapping' | python3 -m json.tool
...
"test_mypage__important_field": {
"type": "text",
"boost": 100.0,
"copy_to": [
"_all_text"
]
},
"test_mypage__lesser_field": {
"type": "text",
"boost": 0.5,
"copy_to": [
"_all_text"
]
},
...
I don’t believe this is considered expected behavior, as the documentation says:
boost (int/float) - This allows you to set fields as being more important than others. Setting this to a high number on a field will cause pages with matches in that field to be ranked higher.
I would expect that adding a boost
to a field would manipulate the relevance scoring accordingly in the default ES implementation.
- I have confirmed that this issue can be reproduced as described on a fresh Wagtail project: yes
Technical details
- Python version:
3.6.7
. - Django version:
2.2.2
- Wagtail version:
2.5.1
- Elasticsearch version:
6.3.0
Please let me know if you need any more information or clarification. Thanks!
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (2 by maintainers)
Top GitHub Comments
For anyone who’s interested in this, it looks like Lucene has removed support for boosting in the normalisation factor: https://github.com/apache/lucene-solr/commit/8ed2b764ed4d4d5203b5df1e16fdc1ffd640322c#diff-c4fe039a2fc01f762ea596b25d4a6bc0
The reason for this is supposedly “poor precision” as the normalisation factors are stored as single byte floats: https://issues.apache.org/jira/browse/LUCENE-6819
I strongly disagree with this decision though, it’s precise enough for our needs as people usually boost things by 2x or 10x. not 1.01x. And nobody cares about how precise the actual scores are, as long as their documents are ranked in the order they expect…
So we now have to do much more work at query time, which will certainly impact search performance for the sake of getting some extra decimal points of precision that we don’t need.
Even PostgreSQL has ranking options for indexed terms (which has been enough for every use case I’ve come across), and Elasticsearch no longer has this at all. Way to go!
Just been stung by this too…
I’m not sure that @ncryer’s approach above will work, because if we’re running a query through
Page.search('...')
, then the list of fields we obtain throughself.mapping.get_mapping()
will only include the core Page fields, not fields defined on specific page subclasses (which do get copied into_all_fields
).I think the best way around this would be to scan through all models to find all distinct boost values in use, then define an all_fields field for each one (
_all_fields_1
,_all_fields_1.5
,_all_fields_2
or whatever) and set upcopy_to
appropriately to funnel data into the right bucket. Then, at query time, we query against those fields, with the corresponding query-time boost applied to each one. If I’m not mistaken, the Postgres backend has to do something very similar, so there’s probably shared code we can refactor out here.