Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dense Passage Retriever Fails to run .eval(), elastic search document store with custom mapping

See original GitHub issue

The Bug

Currently, I switched from a vanilla ElasticSearch Retriever to a Dense Passage Retriever. The originally custom mapped ElasticSearch DocumentStore was kept. The Dense Passage Retriever is working in terms of the QA pipeline, and retrieve operations. The problem arises when I try to evaluate the retriever independently, specifically in the document store call for query_by_embedding() in eval() .

The following is the line that caused the error:

https://github.com/deepset-ai/haystack/blob/143da4cb3f374ba5cfdc5dd9beab888ec82c334d/haystack/document_store/elasticsearch.py#L560

Error message In this function, a search operation is executed by the ElasticSearch object. This search operation results in the following error:

RequestError: RequestError(400, ‘search_phase_execution_exception’, ‘runtime error’)

Expected behavior

Expected the retriever to complete the evaluation successfully.

Additional context Direct search query to elastic search was done afterward in order to verify that the problem does not originate from the elasticsearch database and such resulted in a successful search.

Elastic Search has a custom mapping, such is as follows

pdf_custom_mapping = { "mappings": { "properties": { "chapter": { "type": "text" }, "context": { "fields": { "reverse": { "analyzer": "reverse", "type": "text" }, "trigram": { "analyzer": "trigram", "type": "text" } }, "type": "text" }, "page": { "type": "integer" }, "pdf_code": { "type": "keyword" }, "pdf_title": { "type": "text" }, "pdf_url": { "type": "text" }, "section": { "type": "text" }, "title": { "type": "text" }, "embedding": {"type": "dense_vector", "dims": 768} } }, "settings": { "index": { "analysis": { "analyzer": { "reverse": { "filter": [ "lowercase", "reverse" ], "tokenizer": "standard", "type": "custom" }, "trigram": { "filter": [ "lowercase", "shingle" ], "tokenizer": "standard", "type": "custom" } }, "filter": { "shingle": { "max_shingle_size": 3, "min_shingle_size": 2, "type": "shingle" } } }, "number_of_shards": 1 } } }

System:

Haystack version (commit or version number): v0.5.0
DocumentStore: ElasticSearch
Retriever: Dense Passage Retriever
GPU: Nvidia Tesla T4
Notebook instance on Google Cloud

Issue Analytics

State:
Created 3 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

1reaction

juan541commented, Dec 24, 2020

Hi @tholor

Thank you so much for the help, the issue has been resolved. The issue was the embeddings within the evaluation index, it wasn’t being updated as required.

Again, Thanks for the help!

1reaction

tholorcommented, Dec 24, 2020

Hi @juan541 ,

Thanks for the script. This is super helpful! I could reproduce your bug and trace down to two main reasons:

document_store.add_eval_data(filename=eval_filename, doc_index=evaluation_index_name, label_index=label_index_name) => You are writing your eval documents to an index called “eval_test”. A few lines later you call update_embeddings() to create the embeddings. However, this call is per default just updating the embeddings on your “main document index” (i.e. “document-small-test”). In order to update the embeddings in your eval index, just call document_store.update_embeddings(retriever, index=evaluation_index_name).
We now update the right index. However, the method will still fail because we’ve set embed_title=True when initializing the DPRRetriever and our eval docs don’t contain any “title”. Two options to fix this: 1) set embed_title=False or 2) add a “title” field in your SQuAD-style JSON. It should then look like this "data": [{"title": "my title", "paragraphs": [...

Here’s a code snippet to illustrate how it works. Hope this helps!

from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
from haystack.retriever.dense import DensePassageRetriever
import json

eval_docs = {
  "data": [
    {
      "paragraphs": [
        {
          "context": "CHAPTER 4  LIVE LOADS",
          "qas": [
            {
              "answers": [
                {
                  "answer_start": 0,
                  "text": "CHAPTER 4 LIVE LOADS"
                }
              ],
              "id": "990efb3e-b698-41d1-be3c-d4cd43f8446b",
              "question": "Where do I find requirements for live loads?",
              "is_impossible": False
            }
          ]
        }
      ]
    }
  ]
}
pdf_custom_mapping = {
  "mappings": {
    "properties": {
      "chapter": {
        "type": "text"
      },
      "context": {
        "fields": {
          "reverse": {
            "analyzer": "reverse",
            "type": "text"
          },
          "trigram": {
            "analyzer": "trigram",
            "type": "text"
          }
        },
        "type": "text"
      },
      "page": {
        "type": "integer"
      },
      "pdf_code": {
        "type": "keyword"
      },
      "pdf_title": {
        "type": "text"
      },
      "pdf_url": {
        "type": "text"
      },
      "section": {
        "type": "text"
      },
      "title": {
        "type": "text"
      },
      "embedding": {
        "type": "dense_vector",
        "dims": 768
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "reverse": {
            "filter": [
              "lowercase",
              "reverse"
            ],
            "tokenizer": "standard",
            "type": "custom"
          },
          "trigram": {
            "filter": [
              "lowercase",
              "shingle"
            ],
            "tokenizer": "standard",
            "type": "custom"
          }
        },
        "filter": {
          "shingle": {
            "max_shingle_size": 3,
            "min_shingle_size": 2,
            "type": "shingle"
          }
        }
      },
      "number_of_shards": 1
    }
  }
}

evaluation_index_name='eval_test'
eval_filename = 'eval-file.json'
label_index_name='label_test'
host = "localhost"
document_store = ElasticsearchDocumentStore(host=host, port=9200,index="document-small-test",
                                         username="some_user",
                                         search_fields = ["title", "context"],embedding_field="embedding",
                                         name_field="title", text_field="context", excluded_meta_data=["embedding"],
                                         embedding_dim=768, custom_mapping=pdf_custom_mapping, timeout=10000)

# Create evaluation_file
with open(eval_filename, "w") as eval_file:
    eval_file.write(json.dumps(eval_docs))

# Split your squad eval file into "docs" and "labels". Write docs to index "eval_test" and labels to "label_test"
document_store.add_eval_data(filename=eval_filename, doc_index=evaluation_index_name, label_index=label_index_name)
retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  max_seq_len_query=64,
                                  max_seq_len_passage=256,
                                  batch_size=16,
                                  use_gpu=True,
                                  embed_title=False, # Your SQuAD docs have no "title" information. Either disable this parameter here or add titles to your eval docs.
                                  use_fast_tokenizers=True)

# Update the embeddings for all docs in the index "eval_test"(!)
document_store.update_embeddings(retriever, index=evaluation_index_name)

# Evaluate Retriever on its own
retriever_eval_results = retriever.eval(top_k=10, label_index=label_index_name, doc_index=evaluation_index_name)

Top Results From Across the Web

Tutorial: Evaluation of a QA System - Haystack

Recommended: Start Elasticsearch using Docker via the Haystack utility function ... Evaluate dense retrievers (EmbeddingRetriever or DensePassageRetriever) ...

Evaluating QA: the Retriever & the Full QA System

Using Elasticsearch with SQuAD2.0; Evaluating Retriever Performance; Improving Search Results with a Custom Analyzer. The Full IR QA System.

How Dense Passage Retrievers (DPR) Work

The job of the retriever is to filter through our document store for relevant chunks of information (the documents) and pass them to...

Dense vector field type | Elasticsearch Guide [8.5] | Elastic

The dense_vector field type stores dense vectors of numeric values. ... This lets you perform a brute-force kNN search by scanning all documents...

Notes on Transformers Book Ch. 7 - Christian Mills

We can use the retriever's built-in eval() method for open and closed domain QA, but not for datasets like SubjQA. We can build...