Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to do Dense search against own index

See original GitHub issue

My environment:

OS - Ubuntu 18.04
Java 11.0.11
Python 3.8.8
Python Package versions:
- torch 1.8.1
- faiss-cpu 1.7.0
- pyserini 0.12.0

Problem 1

I followed instructions to create my own minimal index and was able to run the Sparse Retrieval example successfully. However, when I tried to run the Dense retrieval example using the TctColBertQueryEncoder, I encountered the following issues that seem to be caused by me having a newer version of the transformers library, where the requires_faiss and requires_pytorch methods have been replaced with a more general requires_backends method in transformers.file_utils. The following files were affected.

pyserini/dsearch/_dsearcher.py
pyserini/dsearch/_model.py

Problem 2

Replacing them in place in the Pyserini code in my site-packages allowed me to move forward, but now I get the error message:

RuntimeError: Error in faiss::FileIOReader::FileIOReader(const char*) at /__w/faiss-wheels/faiss-wheels/faiss/faiss/impl/io.cpp:81: Error: 'f' failed: could not open /path/to/lucene_index/index for reading: No such file or directory

The /path/to/lucene_index above is a folder where my lucene index was built using pyserini.index. I am guessing that an additional ANN index might be required to be built from the data to allow Dense searching to happen? I looked in the help for pyserini.index but there did not seem to be anything that indicated creation of ANN index.

I can live with the first problem (since I have a local solution) but obviously some fix to that would be nice. For the second problem, some documentation or help with building a local index for dense searching will be very much appreciated.

Thanks!

Issue Analytics

State:
Created 2 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

1reaction

lintoolcommented, Jun 18, 2021

Thanks for the code snippets!

1reaction

sujitpalcommented, Jun 18, 2021

Just wanted to thank @lintool and @MXueguang for the instructions, I was able to create the FAISS sidecar index (docid + index) and use the Sparse, Dense and Hybrid retrieval mechanisms. Sharing code here in case this is useful to add to documentation and/or for others who might have the same requirement as mine. First set of code blocks uses a pre-encoded set of queries in the pickled embedding.pkl file, the next set encodes the queries on the fly using the model and matches against the sidecar FAISS index.

sparse retrieval (baseline, no change)

from pyserini.search import SimpleSearcher

searcher = SimpleSearcher("../data/indexes/cord19_local_idx")
hits = searcher.search("coronavirus origin")
for i in range(10):
    print(i, hits[i].docid, hits[i].score)

dense retrieval with pre-encoded queries

from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder

# encoder = TctColBertQueryEncoder('castorini/tct_colbert-msmarco')
encoder = QueryEncoder(encoded_query_dir="../data/query-embeddings")
searcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
                               encoder)
hits = searcher.search("coronavirus origin")

for i in range(10):
    print(i, hits[i].docid, hits[i].score)

hybrid retrieval with pre-encoded queries

from pyserini.search import SimpleSearcher
from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder
from pyserini.hsearch import HybridSearcher

ssearcher = SimpleSearcher("../data/indexes/cord19_local_idx")
encoder = QueryEncoder(encoded_query_dir="../data/query-embeddings")
dsearcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
                                encoder)
hsearcher = HybridSearcher(dsearcher, ssearcher)
hits = hsearcher.search('coronavirus origin')

for i in range(0, 10):
    print(i, hits[i].docid, hits[i].score)

dense retrieval with custom query encoder, no pre-encoding

from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder
from sentence_transformers import SentenceTransformer

class CustomQueryEncoder(QueryEncoder):
    def __init__(self, model):
        self.has_model = True
        self.model = model

    def encode(self, query: str):
        return self.model.encode([query])[0]


model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
encoder = CustomQueryEncoder(model)
searcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
                               encoder)
hits = searcher.search("coronavirus origin")

for i in range(10):
    print(i, hits[i].docid, hits[i].score)

hybrid retrieval with custom query encoder, no pre-encoding

from pyserini.search import SimpleSearcher
from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder
from pyserini.hsearch import HybridSearcher
from sentence_transformers import SentenceTransformer


class CustomQueryEncoder(QueryEncoder):
    def __init__(self, model):
        self.has_model = True
        self.model = model

    def encode(self, query: str):
        return self.model.encode([query])[0]


model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
ssearcher = SimpleSearcher("../data/indexes/cord19_local_idx")
encoder = CustomQueryEncoder(model)
dsearcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
                                encoder)
hsearcher = HybridSearcher(dsearcher, ssearcher)
hits = hsearcher.search('coronavirus origin')

for i in range(0, 10):
    print(i, hits[i].docid, hits[i].score)