Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Term occurs in document vector, but has collection frequency 0

See original GitHub issue

I’ve found a term that occurs once in a document vector, but doesn’t occur in the collection. Am I using the wrong analyzer or is this a bug? I’ve used the following Pyserini functions:

index_utils = pyutils.IndexReaderUtils('/Index/lucene-index.core18.pos+docvectors+rawdocs_all')
tf = index_utils.get_document_vector(docid)
analyzer = pyanalysis.get_lucene_analyzer(stemming=False, stopwords=False)
df = {term: (index_utils.get_term_counts(term, analyzer=analyzer))[1] for term in tf.keys()}

output:

tf = {.. 'hobbies:photographi': 1, ..}
df = {.. 'hobbies:photographi': 0, ..}

I assume the term is derived from this part in the raw text: “…<b>HOBBIES:</b>Photography…”

Issue Analytics

State:
Created 3 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

lintoolcommented, May 2, 2020

@PepijnBoers please take a look: https://github.com/castorini/anserini/pull/1135

+1 with it if you’re happy.

1reaction

lintoolcommented, May 2, 2020

Yup, you’re right, there’s a bug here.

from pyserini.analysis.pyanalysis import get_lucene_analyzer, Analyzer
analyzer = get_lucene_analyzer(stemming=False, stopwords=False)

index_utils.get_term_counts('hobbies:photographi', analyzer)
# fails: (0, 0)

index_utils.get_term_counts('hobbies\:photographi', analyzer)
# works: (1, 1)

index_utils.get_term_counts('hobbies:photography')
# fails: (0, 0)

index_utils.get_term_counts('hobbies\:photography')
# works: (1, 1)

What’s happening is that a:b is getting interpreted by Lucene as a field query, i.e., where “a” is the field name.

This is because we run the query through a query parser: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexReaderUtils.java#L210

We shouldn’t.

Although this does the right thing:

postings_list = index_utils.get_postings_list('hobbies:photography')
for posting in postings_list:
    print(f'docid={posting.docid}, tf={posting.tf}, pos={posting.positions}')

This requires a batch to Anserini and then a new maven artifact deploy. I’ll get on it.

Thanks for catching the bug!