Anchor Text for msmarco-document and msmarco-document-v2
See original GitHub issueDataset Information:
We have extracted anchor text pointing to documents in MS MARCO (version 1 and version 2) from several Common Crawl snapshots that can be used as additional retrieval features or for the training of models (e.g., in a distant supervision style like DeepCT).
Links to Resources:
Dataset ID(s) & supported entities:
- Dataset ID:
msmarco-document/anchor-text
andmsmarco-document-v2/anchor-text
Checklist
Mark each task once completed. All should be checked prior to merging a new dataset.
- Dataset definition (in
ir_datasets/datasets/[topid].py
) - Tests (in
tests/integration/[topid].py
) - Metadata generated (using
ir_datasets generate_metadata
command, should appear inir_datasets/etc/metadata.json
) - Documentation (in
ir_datasets/etc/[topid].yaml
)- Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
- Downloadable content (in
ir_datasets/etc/downloads.json
)- Download verification action (in
.github/workflows/verify_downloads.yml
). Only one needed pertopid
. - Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in
downloads.json
.
- Download verification action (in
Additional comments/concerns/ideas/etc.
I would be happy to help integrate the anchor texts into ir_datasets
. I am not sure what a good Dataset ID would be, it can make sense to integrate it as a subset into the existing msmarco-document
and msmarco-document-v2
Ids but it might also make sense to have it as independent Ids.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
ir_datasets : MSMARCO (document) - ir-datasets
The text field contains the anchor texts concatenated and the anchors field contains the anchor texts as list. The raw dataset with additional...
Read more >MS MARCO - Microsoft Open Source
Starting with a paper released at NIPS 2016, MS MARCO is a collection of datasets focused on deep learning in search. The first...
Read more >pygaggle/experiments-msmarco-document.md at master
a gaggle of deep neural architectures for text ranking and question answering, designed for Pyserini ... PyGaggle: Baselines on MS MARCO Document Retrieval....
Read more >cross-encoder/ms-marco-TinyBERT-L-2-v2 - Hugging Face
This model was trained on the MS Marco Passage Ranking task. The model can be used for Information Retrieval: Given a query, encode...
Read more >Longformer for MS MARCO Document Re-ranking Task - arXiv
We employ Longformer, a BERT-like model for long documents, on the MS MARCO document re-ranking task. The complete code used for training ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Excellent, thanks again @mam10eks!
@mam10eks – can you accept the PR here with the metadata when you get a chance? https://github.com/mam10eks/ir_datasets/pull/1