Unable to do Dense search against own index
See original GitHub issueMy environment:
- OS - Ubuntu 18.04
- Java 11.0.11
- Python 3.8.8
- Python Package versions:
- torch 1.8.1
- faiss-cpu 1.7.0
- pyserini 0.12.0
Problem 1
I followed instructions to create my own minimal index and was able to run the Sparse Retrieval example successfully. However, when I tried to run the Dense retrieval example using the TctColBertQueryEncoder, I encountered the following issues that seem to be caused by me having a newer version of the transformers library, where the requires_faiss
and requires_pytorch
methods have been replaced with a more general requires_backends
method in transformers.file_utils
. The following files were affected.
pyserini/dsearch/_dsearcher.py
pyserini/dsearch/_model.py
Problem 2
Replacing them in place in the Pyserini code in my site-packages
allowed me to move forward, but now I get the error message:
RuntimeError: Error in faiss::FileIOReader::FileIOReader(const char*) at /__w/faiss-wheels/faiss-wheels/faiss/faiss/impl/io.cpp:81: Error: 'f' failed: could not open /path/to/lucene_index/index for reading: No such file or directory
The /path/to/lucene_index
above is a folder where my lucene index was built using pyserini.index
. I am guessing that an additional ANN index might be required to be built from the data to allow Dense searching to happen? I looked in the help for pyserini.index
but there did not seem to be anything that indicated creation of ANN index.
I can live with the first problem (since I have a local solution) but obviously some fix to that would be nice. For the second problem, some documentation or help with building a local index for dense searching will be very much appreciated.
Thanks!
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (6 by maintainers)
Top GitHub Comments
Thanks for the code snippets!
Just wanted to thank @lintool and @MXueguang for the instructions, I was able to create the FAISS sidecar index (docid + index) and use the Sparse, Dense and Hybrid retrieval mechanisms. Sharing code here in case this is useful to add to documentation and/or for others who might have the same requirement as mine. First set of code blocks uses a pre-encoded set of queries in the pickled embedding.pkl file, the next set encodes the queries on the fly using the model and matches against the sidecar FAISS index.
sparse retrieval (baseline, no change)
dense retrieval with pre-encoded queries
hybrid retrieval with pre-encoded queries
dense retrieval with custom query encoder, no pre-encoding
hybrid retrieval with custom query encoder, no pre-encoding