question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Term occurs in document vector, but has collection frequency 0

See original GitHub issue

I’ve found a term that occurs once in a document vector, but doesn’t occur in the collection. Am I using the wrong analyzer or is this a bug? I’ve used the following Pyserini functions:

index_utils = pyutils.IndexReaderUtils('/Index/lucene-index.core18.pos+docvectors+rawdocs_all')
tf = index_utils.get_document_vector(docid)
analyzer = pyanalysis.get_lucene_analyzer(stemming=False, stopwords=False)
df = {term: (index_utils.get_term_counts(term, analyzer=analyzer))[1] for term in tf.keys()}

output:

tf = {.. 'hobbies:photographi': 1, ..}
df = {.. 'hobbies:photographi': 0, ..}

I assume the term is derived from this part in the raw text: “…<b>HOBBIES:</b>Photography…”

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
lintoolcommented, May 2, 2020

@PepijnBoers please take a look: https://github.com/castorini/anserini/pull/1135

+1 with it if you’re happy.

1reaction
lintoolcommented, May 2, 2020

Yup, you’re right, there’s a bug here.

from pyserini.analysis.pyanalysis import get_lucene_analyzer, Analyzer
analyzer = get_lucene_analyzer(stemming=False, stopwords=False)

index_utils.get_term_counts('hobbies:photographi', analyzer)
# fails: (0, 0)

index_utils.get_term_counts('hobbies\:photographi', analyzer)
# works: (1, 1)

index_utils.get_term_counts('hobbies:photography')
# fails: (0, 0)

index_utils.get_term_counts('hobbies\:photography')
# works: (1, 1)

What’s happening is that a:b is getting interpreted by Lucene as a field query, i.e., where “a” is the field name.

This is because we run the query through a query parser: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexReaderUtils.java#L210

We shouldn’t.

Although this does the right thing:

postings_list = index_utils.get_postings_list('hobbies:photography')
for posting in postings_list:
    print(f'docid={posting.docid}, tf={posting.tf}, pos={posting.positions}')

This requires a batch to Anserini and then a new maven artifact deploy. I’ll get on it.

Thanks for catching the bug!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scoring, Term Weighting and the - Information Retrieval
Vector space scoring ... If the query term does not occur in the document: score should be 0 ... Each document is represented...
Read more >
Lecture 4: Term Weighting and the Vector Space Model
But not 10 times more relevant. Relevance does not increase proportionally with term frequency. The score is 0 if none of the query...
Read more >
Inverse Document Frequency - an overview - ScienceDirect.com
If the term ti appears in every document of the corpus, idfi is equal to 0. The fewer documents the term ti appears...
Read more >
Understanding TF-IDF for Machine Learning | Capital One
Boolean frequency (e.g. 1 if the term occurs, or 0 if the term does not occur, in the document). What is IDF (inverse...
Read more >
Processing of Large Document Collections: Exercise 1.
A binary vector (consisting of 1's and 0's) would do as well. ... If the term occurs in every document of the collection,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found