question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Anchor Text for msmarco-document and msmarco-document-v2

See original GitHub issue

Dataset Information:

We have extracted anchor text pointing to documents in MS MARCO (version 1 and version 2) from several Common Crawl snapshots that can be used as additional retrieval features or for the training of models (e.g., in a distant supervision style like DeepCT).

Links to Resources:

Dataset ID(s) & supported entities:

  • Dataset ID: msmarco-document/anchor-text and msmarco-document-v2/anchor-text

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

  • Dataset definition (in ir_datasets/datasets/[topid].py)
  • Tests (in tests/integration/[topid].py)
  • Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
  • Documentation (in ir_datasets/etc/[topid].yaml)
  • Downloadable content (in ir_datasets/etc/downloads.json)
    • Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
    • Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

I would be happy to help integrate the anchor texts into ir_datasets. I am not sure what a good Dataset ID would be, it can make sense to integrate it as a subset into the existing msmarco-document and msmarco-document-v2 Ids but it might also make sense to have it as independent Ids.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
seanmacavaneycommented, Jan 20, 2022

Excellent, thanks again @mam10eks!

0reactions
seanmacavaneycommented, Jan 20, 2022

@mam10eks – can you accept the PR here with the metadata when you get a chance? https://github.com/mam10eks/ir_datasets/pull/1

Read more comments on GitHub >

github_iconTop Results From Across the Web

ir_datasets : MSMARCO (document) - ir-datasets
The text field contains the anchor texts concatenated and the anchors field contains the anchor texts as list. The raw dataset with additional...
Read more >
MS MARCO - Microsoft Open Source
Starting with a paper released at NIPS 2016, MS MARCO is a collection of datasets focused on deep learning in search. The first...
Read more >
pygaggle/experiments-msmarco-document.md at master
a gaggle of deep neural architectures for text ranking and question answering, designed for Pyserini ... PyGaggle: Baselines on MS MARCO Document Retrieval....
Read more >
cross-encoder/ms-marco-TinyBERT-L-2-v2 - Hugging Face
This model was trained on the MS Marco Passage Ranking task. The model can be used for Information Retrieval: Given a query, encode...
Read more >
Longformer for MS MARCO Document Re-ranking Task - arXiv
We employ Longformer, a BERT-like model for long documents, on the MS MARCO document re-ranking task. The complete code used for training ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found