question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dense Passage Retriever Fails to run .eval(), elastic search document store with custom mapping

See original GitHub issue

The Bug

Currently, I switched from a vanilla ElasticSearch Retriever to a Dense Passage Retriever. The originally custom mapped ElasticSearch DocumentStore was kept. The Dense Passage Retriever is working in terms of the QA pipeline, and retrieve operations. The problem arises when I try to evaluate the retriever independently, specifically in the document store call for query_by_embedding() in eval() .

The following is the line that caused the error:

https://github.com/deepset-ai/haystack/blob/143da4cb3f374ba5cfdc5dd9beab888ec82c334d/haystack/document_store/elasticsearch.py#L560

Error message In this function, a search operation is executed by the ElasticSearch object. This search operation results in the following error:

RequestError: RequestError(400, ‘search_phase_execution_exception’, ‘runtime error’)

Expected behavior

Expected the retriever to complete the evaluation successfully.

Additional context Direct search query to elastic search was done afterward in order to verify that the problem does not originate from the elasticsearch database and such resulted in a successful search.

Elastic Search has a custom mapping, such is as follows

pdf_custom_mapping = { "mappings": { "properties": { "chapter": { "type": "text" }, "context": { "fields": { "reverse": { "analyzer": "reverse", "type": "text" }, "trigram": { "analyzer": "trigram", "type": "text" } }, "type": "text" }, "page": { "type": "integer" }, "pdf_code": { "type": "keyword" }, "pdf_title": { "type": "text" }, "pdf_url": { "type": "text" }, "section": { "type": "text" }, "title": { "type": "text" }, "embedding": {"type": "dense_vector", "dims": 768} } }, "settings": { "index": { "analysis": { "analyzer": { "reverse": { "filter": [ "lowercase", "reverse" ], "tokenizer": "standard", "type": "custom" }, "trigram": { "filter": [ "lowercase", "shingle" ], "tokenizer": "standard", "type": "custom" } }, "filter": { "shingle": { "max_shingle_size": 3, "min_shingle_size": 2, "type": "shingle" } } }, "number_of_shards": 1 } } }

System:

  • Haystack version (commit or version number): v0.5.0
  • DocumentStore: ElasticSearch
  • Retriever: Dense Passage Retriever
  • GPU: Nvidia Tesla T4
  • Notebook instance on Google Cloud

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
juan541commented, Dec 24, 2020

Hi @tholor

Thank you so much for the help, the issue has been resolved. The issue was the embeddings within the evaluation index, it wasn’t being updated as required.

Again, Thanks for the help!

1reaction
tholorcommented, Dec 24, 2020

Hi @juan541 ,

Thanks for the script. This is super helpful! I could reproduce your bug and trace down to two main reasons:

  1. document_store.add_eval_data(filename=eval_filename, doc_index=evaluation_index_name, label_index=label_index_name) => You are writing your eval documents to an index called “eval_test”. A few lines later you call update_embeddings() to create the embeddings. However, this call is per default just updating the embeddings on your “main document index” (i.e. “document-small-test”). In order to update the embeddings in your eval index, just call document_store.update_embeddings(retriever, index=evaluation_index_name).

  2. We now update the right index. However, the method will still fail because we’ve set embed_title=True when initializing the DPRRetriever and our eval docs don’t contain any “title”. Two options to fix this: 1) set embed_title=False or 2) add a “title” field in your SQuAD-style JSON. It should then look like this "data": [{"title": "my title", "paragraphs": [...

Here’s a code snippet to illustrate how it works. Hope this helps!

from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
from haystack.retriever.dense import DensePassageRetriever
import json

eval_docs = {
  "data": [
    {
      "paragraphs": [
        {
          "context": "CHAPTER 4  LIVE LOADS",
          "qas": [
            {
              "answers": [
                {
                  "answer_start": 0,
                  "text": "CHAPTER 4 LIVE LOADS"
                }
              ],
              "id": "990efb3e-b698-41d1-be3c-d4cd43f8446b",
              "question": "Where do I find requirements for live loads?",
              "is_impossible": False
            }
          ]
        }
      ]
    }
  ]
}
pdf_custom_mapping = {
  "mappings": {
    "properties": {
      "chapter": {
        "type": "text"
      },
      "context": {
        "fields": {
          "reverse": {
            "analyzer": "reverse",
            "type": "text"
          },
          "trigram": {
            "analyzer": "trigram",
            "type": "text"
          }
        },
        "type": "text"
      },
      "page": {
        "type": "integer"
      },
      "pdf_code": {
        "type": "keyword"
      },
      "pdf_title": {
        "type": "text"
      },
      "pdf_url": {
        "type": "text"
      },
      "section": {
        "type": "text"
      },
      "title": {
        "type": "text"
      },
      "embedding": {
        "type": "dense_vector",
        "dims": 768
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "reverse": {
            "filter": [
              "lowercase",
              "reverse"
            ],
            "tokenizer": "standard",
            "type": "custom"
          },
          "trigram": {
            "filter": [
              "lowercase",
              "shingle"
            ],
            "tokenizer": "standard",
            "type": "custom"
          }
        },
        "filter": {
          "shingle": {
            "max_shingle_size": 3,
            "min_shingle_size": 2,
            "type": "shingle"
          }
        }
      },
      "number_of_shards": 1
    }
  }
}

evaluation_index_name='eval_test'
eval_filename = 'eval-file.json'
label_index_name='label_test'
host = "localhost"
document_store = ElasticsearchDocumentStore(host=host, port=9200,index="document-small-test",
                                         username="some_user",
                                         search_fields = ["title", "context"],embedding_field="embedding",
                                         name_field="title", text_field="context", excluded_meta_data=["embedding"],
                                         embedding_dim=768, custom_mapping=pdf_custom_mapping, timeout=10000)

# Create evaluation_file
with open(eval_filename, "w") as eval_file:
    eval_file.write(json.dumps(eval_docs))

# Split your squad eval file into "docs" and "labels". Write docs to index "eval_test" and labels to "label_test"
document_store.add_eval_data(filename=eval_filename, doc_index=evaluation_index_name, label_index=label_index_name)
retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  max_seq_len_query=64,
                                  max_seq_len_passage=256,
                                  batch_size=16,
                                  use_gpu=True,
                                  embed_title=False, # Your SQuAD docs have no "title" information. Either disable this parameter here or add titles to your eval docs.
                                  use_fast_tokenizers=True)

# Update the embeddings for all docs in the index "eval_test"(!)
document_store.update_embeddings(retriever, index=evaluation_index_name)

# Evaluate Retriever on its own
retriever_eval_results = retriever.eval(top_k=10, label_index=label_index_name, doc_index=evaluation_index_name)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Tutorial: Evaluation of a QA System - Haystack
Recommended: Start Elasticsearch using Docker via the Haystack utility function ... Evaluate dense retrievers (EmbeddingRetriever or DensePassageRetriever) ...
Read more >
Evaluating QA: the Retriever & the Full QA System
Using Elasticsearch with SQuAD2.0; Evaluating Retriever Performance; Improving Search Results with a Custom Analyzer. The Full IR QA System.
Read more >
How Dense Passage Retrievers (DPR) Work
The job of the retriever is to filter through our document store for relevant chunks of information (the documents) and pass them to...
Read more >
Dense vector field type | Elasticsearch Guide [8.5] | Elastic
The dense_vector field type stores dense vectors of numeric values. ... This lets you perform a brute-force kNN search by scanning all documents...
Read more >
Notes on Transformers Book Ch. 7 - Christian Mills
We can use the retriever's built-in eval() method for open and closed domain QA, but not for datasets like SubjQA. We can build...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found