Dense Passage Retriever Fails to run .eval(), elastic search document store with custom mapping
See original GitHub issueThe Bug
Currently, I switched from a vanilla ElasticSearch Retriever to a Dense Passage Retriever. The originally custom mapped ElasticSearch DocumentStore was kept. The Dense Passage Retriever is working in terms of the QA pipeline, and retrieve operations. The problem arises when I try to evaluate the retriever independently, specifically in the document store call for query_by_embedding() in eval() .
The following is the line that caused the error:
Error message In this function, a search operation is executed by the ElasticSearch object. This search operation results in the following error:
RequestError: RequestError(400, ‘search_phase_execution_exception’, ‘runtime error’)
Expected behavior
Expected the retriever to complete the evaluation successfully.
Additional context Direct search query to elastic search was done afterward in order to verify that the problem does not originate from the elasticsearch database and such resulted in a successful search.
Elastic Search has a custom mapping, such is as follows
pdf_custom_mapping = { "mappings": { "properties": { "chapter": { "type": "text" }, "context": { "fields": { "reverse": { "analyzer": "reverse", "type": "text" }, "trigram": { "analyzer": "trigram", "type": "text" } }, "type": "text" }, "page": { "type": "integer" }, "pdf_code": { "type": "keyword" }, "pdf_title": { "type": "text" }, "pdf_url": { "type": "text" }, "section": { "type": "text" }, "title": { "type": "text" }, "embedding": {"type": "dense_vector", "dims": 768} } }, "settings": { "index": { "analysis": { "analyzer": { "reverse": { "filter": [ "lowercase", "reverse" ], "tokenizer": "standard", "type": "custom" }, "trigram": { "filter": [ "lowercase", "shingle" ], "tokenizer": "standard", "type": "custom" } }, "filter": { "shingle": { "max_shingle_size": 3, "min_shingle_size": 2, "type": "shingle" } } }, "number_of_shards": 1 } } }
System:
- Haystack version (commit or version number): v0.5.0
- DocumentStore: ElasticSearch
- Retriever: Dense Passage Retriever
- GPU: Nvidia Tesla T4
- Notebook instance on Google Cloud
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (6 by maintainers)
Hi @tholor
Thank you so much for the help, the issue has been resolved. The issue was the embeddings within the evaluation index, it wasn’t being updated as required.
Again, Thanks for the help!
Hi @juan541 ,
Thanks for the script. This is super helpful! I could reproduce your bug and trace down to two main reasons:
document_store.add_eval_data(filename=eval_filename, doc_index=evaluation_index_name, label_index=label_index_name)
=> You are writing your eval documents to an index called “eval_test”. A few lines later you callupdate_embeddings()
to create the embeddings. However, this call is per default just updating the embeddings on your “main document index” (i.e. “document-small-test”). In order to update the embeddings in your eval index, just calldocument_store.update_embeddings(retriever, index=evaluation_index_name)
.We now update the right index. However, the method will still fail because we’ve set
embed_title=True
when initializing the DPRRetriever and our eval docs don’t contain any “title”. Two options to fix this: 1) setembed_title=False
or 2) add a “title” field in your SQuAD-style JSON. It should then look like this"data": [{"title": "my title", "paragraphs": [...
Here’s a code snippet to illustrate how it works. Hope this helps!