Term occurs in document vector, but has collection frequency 0
See original GitHub issueI’ve found a term that occurs once in a document vector, but doesn’t occur in the collection. Am I using the wrong analyzer or is this a bug? I’ve used the following Pyserini functions:
index_utils = pyutils.IndexReaderUtils('/Index/lucene-index.core18.pos+docvectors+rawdocs_all')
tf = index_utils.get_document_vector(docid)
analyzer = pyanalysis.get_lucene_analyzer(stemming=False, stopwords=False)
df = {term: (index_utils.get_term_counts(term, analyzer=analyzer))[1] for term in tf.keys()}
output:
tf = {.. 'hobbies:photographi': 1, ..}
df = {.. 'hobbies:photographi': 0, ..}
I assume the term is derived from this part in the raw text: “…<b>HOBBIES:</b>Photography…”
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (6 by maintainers)
Top Results From Across the Web
Scoring, Term Weighting and the - Information Retrieval
Vector space scoring ... If the query term does not occur in the document: score should be 0 ... Each document is represented...
Read more >Lecture 4: Term Weighting and the Vector Space Model
But not 10 times more relevant. Relevance does not increase proportionally with term frequency. The score is 0 if none of the query...
Read more >Inverse Document Frequency - an overview - ScienceDirect.com
If the term ti appears in every document of the corpus, idfi is equal to 0. The fewer documents the term ti appears...
Read more >Understanding TF-IDF for Machine Learning | Capital One
Boolean frequency (e.g. 1 if the term occurs, or 0 if the term does not occur, in the document). What is IDF (inverse...
Read more >Processing of Large Document Collections: Exercise 1.
A binary vector (consisting of 1's and 0's) would do as well. ... If the term occurs in every document of the collection,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@PepijnBoers please take a look: https://github.com/castorini/anserini/pull/1135
+1 with it if you’re happy.
Yup, you’re right, there’s a bug here.
What’s happening is that
a:b
is getting interpreted by Lucene as a field query, i.e., where “a” is the field name.This is because we run the query through a query parser: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexReaderUtils.java#L210
We shouldn’t.
Although this does the right thing:
This requires a batch to Anserini and then a new maven artifact deploy. I’ll get on it.
Thanks for catching the bug!