question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to do Dense search against own index

See original GitHub issue

My environment:

  • OS - Ubuntu 18.04
  • Java 11.0.11
  • Python 3.8.8
  • Python Package versions:
    • torch 1.8.1
    • faiss-cpu 1.7.0
    • pyserini 0.12.0

Problem 1

I followed instructions to create my own minimal index and was able to run the Sparse Retrieval example successfully. However, when I tried to run the Dense retrieval example using the TctColBertQueryEncoder, I encountered the following issues that seem to be caused by me having a newer version of the transformers library, where the requires_faiss and requires_pytorch methods have been replaced with a more general requires_backends method in transformers.file_utils. The following files were affected.

pyserini/dsearch/_dsearcher.py
pyserini/dsearch/_model.py

Problem 2

Replacing them in place in the Pyserini code in my site-packages allowed me to move forward, but now I get the error message:

RuntimeError: Error in faiss::FileIOReader::FileIOReader(const char*) at /__w/faiss-wheels/faiss-wheels/faiss/faiss/impl/io.cpp:81: Error: 'f' failed: could not open /path/to/lucene_index/index for reading: No such file or directory

The /path/to/lucene_index above is a folder where my lucene index was built using pyserini.index. I am guessing that an additional ANN index might be required to be built from the data to allow Dense searching to happen? I looked in the help for pyserini.index but there did not seem to be anything that indicated creation of ANN index.

I can live with the first problem (since I have a local solution) but obviously some fix to that would be nice. For the second problem, some documentation or help with building a local index for dense searching will be very much appreciated.

Thanks!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
lintoolcommented, Jun 18, 2021

Thanks for the code snippets!

1reaction
sujitpalcommented, Jun 18, 2021

Just wanted to thank @lintool and @MXueguang for the instructions, I was able to create the FAISS sidecar index (docid + index) and use the Sparse, Dense and Hybrid retrieval mechanisms. Sharing code here in case this is useful to add to documentation and/or for others who might have the same requirement as mine. First set of code blocks uses a pre-encoded set of queries in the pickled embedding.pkl file, the next set encodes the queries on the fly using the model and matches against the sidecar FAISS index.

sparse retrieval (baseline, no change)

from pyserini.search import SimpleSearcher

searcher = SimpleSearcher("../data/indexes/cord19_local_idx")
hits = searcher.search("coronavirus origin")
for i in range(10):
    print(i, hits[i].docid, hits[i].score)

dense retrieval with pre-encoded queries

from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder

# encoder = TctColBertQueryEncoder('castorini/tct_colbert-msmarco')
encoder = QueryEncoder(encoded_query_dir="../data/query-embeddings")
searcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
                               encoder)
hits = searcher.search("coronavirus origin")

for i in range(10):
    print(i, hits[i].docid, hits[i].score)

hybrid retrieval with pre-encoded queries

from pyserini.search import SimpleSearcher
from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder
from pyserini.hsearch import HybridSearcher

ssearcher = SimpleSearcher("../data/indexes/cord19_local_idx")
encoder = QueryEncoder(encoded_query_dir="../data/query-embeddings")
dsearcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
                                encoder)
hsearcher = HybridSearcher(dsearcher, ssearcher)
hits = hsearcher.search('coronavirus origin')

for i in range(0, 10):
    print(i, hits[i].docid, hits[i].score)

dense retrieval with custom query encoder, no pre-encoding

from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder
from sentence_transformers import SentenceTransformer

class CustomQueryEncoder(QueryEncoder):
    def __init__(self, model):
        self.has_model = True
        self.model = model

    def encode(self, query: str):
        return self.model.encode([query])[0]


model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
encoder = CustomQueryEncoder(model)
searcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
                               encoder)
hits = searcher.search("coronavirus origin")

for i in range(10):
    print(i, hits[i].docid, hits[i].score)

hybrid retrieval with custom query encoder, no pre-encoding

from pyserini.search import SimpleSearcher
from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder
from pyserini.hsearch import HybridSearcher
from sentence_transformers import SentenceTransformer


class CustomQueryEncoder(QueryEncoder):
    def __init__(self, model):
        self.has_model = True
        self.model = model

    def encode(self, query: str):
        return self.model.encode([query])[0]


model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
ssearcher = SimpleSearcher("../data/indexes/cord19_local_idx")
encoder = CustomQueryEncoder(model)
dsearcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
                                encoder)
hsearcher = HybridSearcher(dsearcher, ssearcher)
hits = hsearcher.search('coronavirus origin')

for i in range(0, 10):
    print(i, hits[i].docid, hits[i].score)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Sparse/Dense index and how does it work? - Stack Overflow
A block-level sparse index will have fewer values but still be useful to find the approximate location before starting a sequential scan. The ......
Read more >
Dense vector field type | Elasticsearch Guide [8.5] | Elastic
Unlike most other data types, dense vectors are always single-valued. It is not possible to store multiple values in one dense_vector field. Index...
Read more >
Optimize index maintenance to improve query performance ...
This article describes index maintenance concepts, and a recommended strategy to maintain indexes.
Read more >
What are sparse and dense indexes? - Yet Another Dev Blog
One quality that database indexes can have is that they can be dense or sparse. Each of these index qualities come with their...
Read more >
Indexing in DBMS: What is, Types of Indexes with EXAMPLES
In this Indexing, method records contain search key value and points to the real record on the disk. Dense Index. Sparse Index. It...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found