Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Anchor Text for msmarco-document and msmarco-document-v2

See original GitHub issue

Dataset Information:

We have extracted anchor text pointing to documents in MS MARCO (version 1 and version 2) from several Common Crawl snapshots that can be used as additional retrieval features or for the training of models (e.g., in a distant supervision style like DeepCT).

Links to Resources:

Dataset ID(s) & supported entities:

Dataset ID: msmarco-document/anchor-text and msmarco-document-v2/anchor-text

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

Dataset definition (in ir_datasets/datasets/[topid].py)
Tests (in tests/integration/[topid].py)
Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
Documentation (in ir_datasets/etc/[topid].yaml)
- Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
Downloadable content (in ir_datasets/etc/downloads.json)
- Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
- Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

I would be happy to help integrate the anchor texts into ir_datasets. I am not sure what a good Dataset ID would be, it can make sense to integrate it as a subset into the existing msmarco-document and msmarco-document-v2 Ids but it might also make sense to have it as independent Ids.

Issue Analytics

State:
Created 2 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

seanmacavaneycommented, Jan 20, 2022

Excellent, thanks again @mam10eks!

0reactions

seanmacavaneycommented, Jan 20, 2022

@mam10eks – can you accept the PR here with the metadata when you get a chance? https://github.com/mam10eks/ir_datasets/pull/1

Top Results From Across the Web

ir_datasets : MSMARCO (document) - ir-datasets

The text field contains the anchor texts concatenated and the anchors field contains the anchor texts as list. The raw dataset with additional...

MS MARCO - Microsoft Open Source

Starting with a paper released at NIPS 2016, MS MARCO is a collection of datasets focused on deep learning in search. The first...

pygaggle/experiments-msmarco-document.md at master

a gaggle of deep neural architectures for text ranking and question answering, designed for Pyserini ... PyGaggle: Baselines on MS MARCO Document Retrieval....

cross-encoder/ms-marco-TinyBERT-L-2-v2 - Hugging Face

This model was trained on the MS Marco Passage Ranking task. The model can be used for Information Retrieval: Given a query, encode...

Longformer for MS MARCO Document Re-ranking Task - arXiv

We employ Longformer, a BERT-like model for long documents, on the MS MARCO document re-ranking task. The complete code used for training ...