question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ranking results using weighted documents lower than the paper

See original GitHub issue

Hi, Thanks for sharing the data and source code!

I tried to reproduce the result using the shared Virtual Appendix/weighted_documents.Here are the steps I followed on the anserini expreriments on msmarco.

  1. unzip sqrt_sample_100_jsonl.zip
  2. build anserini index by sh ../anserini/target/appassembler/bin/IndexCollection \ -collection JsonCollection \ -input sqrt_sample_100_jsonl \ -index lucene-index.msmarco.deepct \ -generator LuceneDocumentGenerator \ -threads 10 -storePositions \ -storeRawDocs > log.msmarco.deepct
  3. do search on the index by ../anserini/target/appassembler/bin/SearchMsmarco -hits 1000 -threads 10 \ -index lucene-index.msmarco.deepct -qid_queries msmarco/queries.dev.small.tsv \ -output output/run.dev.small.tsv
  4. eval the reasult python ../anserini/src/main/python/msmarco/msmarco_eval.py \ msmarco/qrels.dev.small.tsv output/run.dev.small.tsv

However, I could only get aboult 0.22 on sample_100_jsonl.zip.

##################### MRR @10: 0.22490426160913188 QueriesRanked: 6980 #####################

The result on sqrt_sample_100_jsonl.zip is aboult 0.20.

##################### MRR @10: 0.20240204438986623 QueriesRanked: 6980 #####################

But the paper says it can be 0.24 on the dev set. Is there anything wrong with my processing?

Thanks!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
AdeDZYcommented, Dec 23, 2019

It is critical to fine tune the k1 and b parameters of BM25. The optimal k1 should be around 9-13, b is around 0.7-0.9.

On Mon, Dec 23, 2019 at 12:49 AM midori1 notifications@github.com wrote:

Hi, Thanks for sharing the data and source code!

I tried to reproduce the result using the shared Virtual Appendix/weighted_documents http://boston.lti.cs.cmu.edu/appendices/arXiv2019-DeepCT-Zhuyun-Dai/weighted_documents/.Here are the steps I followed on the anserini expreriments on msmarco https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md .

  1. unzip sqrt_sample_100_jsonl.zip
  2. build anserini index by sh …/anserini/target/appassembler/bin/IndexCollection \ -collection JsonCollection \ -input sqrt_sample_100_jsonl \ -index lucene-index.msmarco.deepct \ -generator LuceneDocumentGenerator \ -threads 10 -storePositions \ -storeRawDocs > log.msmarco.deepct
  3. do search on the index by …/anserini/target/appassembler/bin/SearchMsmarco -hits 1000 -threads 10 \ -index lucene-index.msmarco.deepct -qid_queries msmarco/queries.dev.small.tsv \ -output output/run.dev.small.tsv
  4. eval the reasult python …/anserini/src/main/python/msmarco/msmarco_eval.py
    msmarco/qrels.dev.small.tsv output/run.dev.small.tsv

However, I could only get aboult 0.22 on sample_100_jsonl.zip.

##################### MRR @10 https://github.com/10: 0.22490426160913188 QueriesRanked: 6980 #####################

The result on sqrt_sample_100_jsonl.zip is aboult 0.20.

##################### MRR @10 https://github.com/10: 0.20240204438986623 QueriesRanked: 6980 #####################

But the paper says it can be 0.24 on the dev set. Is there anything wrong with my processing?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/DeepCT/issues/2?email_source=notifications&email_token=ABHQHGAKSFIDXCRJNH3ZHADQ2A7N5A5CNFSM4J6PTBWKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ICHXDPQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHQHGE5ZRUXSH57TYY3QZDQ2A7N5ANCNFSM4J6PTBWA .

0reactions
LAW991224commented, May 1, 2020

Thank you for your kind explaination ! Best wishes

在2020-04-29 23:25:23,Dai Zhuyun (戴竹韵)notifications@github.com写道:

Hi, thanks for checking out my work! You should tune the parameters when running the retrieval, i.e., python anserini/src/main/python/msmarco/retrieve.py -k 8.0 -b 0.9 …

Best, Zhuyun

On Wed, Apr 29, 2020 at 4:03 AM LAW991224 notifications@github.com wrote:

Hi, Thank you for sharing the data and source code! I wonder is there a way to fine tune k1 and b without rebuilding index (using anserini method)? At the moment, everytime I want to change the values for k1 and b , I have to delete the existing index and build another one (because the values for k1 and b are specified when building index),which is time-costing.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/DeepCT/issues/2#issuecomment-621051346, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHQHGFJFM4DMHE6D4GKK7LRO7NMPANCNFSM4J6PTBWA .

– Zhuyun Dai Language Technologies Institute School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Ranking Results – How Google Search Works
The weight applied to each factor varies depending on the nature of your query. For example, the freshness of the content plays a...
Read more >
Scoring, Term Weighting and the - Information Retrieval
For the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of...
Read more >
Using Ranking and Weighting in document search results
Use Ranking /Weighting from the Formatter tool (add to the View column and tick the option in the Sort column) on search results...
Read more >
Automatically Combining Ranking Heuristics for HTML ...
In the current paper we apply an automatic method for combining HTML rank- ing heuristics. Using recall/precision evaluations we study.
Read more >
Customizing Search Results Ranking - Coveo Platform 7
In the navigation panel on the left, select Ranking Weights. ... Example: If the query term is Coveo, a document with the title...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found