Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pre-train on Wikipedia dump: Questions about data

See original GitHub issue

Hello,

Nice paper! 😃 I want to train the bi-encoder as described in section 5.2.2 of your paper and have some questions about the data that you used.

Can you clarify how the subset of the linked mentions is selected?

we pre-train our models on Wikipedia data. We use the May 2019 English Wikipedia dump which includes 5.9M entities, and use the hyperlinks in articles as examples (the anchor text is the mention). We use a subset of all Wikipedia linked mentions as our training data (A total of 9M examples).

What is the format of the input data for training the model? train_biencoder.py tries to load training data from a train.jsonl. Can you give a few example rows for such a file?

Is get_processed_data.sh used to process the data? The name would suggest so, lol. However the README.md of that folder says [deprecated], so I am not sure. (Maybe you could remove the deprecated code from the repository, and use a release tag instead for the old code.)

Could you upload the processed training data?

Issue Analytics

State:
Created 3 years ago
Reactions:9
Comments:7 (2 by maintainers)

Top GitHub Comments

7reactions

fabiopetronicommented, Dec 14, 2020

the training data we used can be download from http://dl.fbaipublicfiles.com/KILT/blink-train-kilt.jsonl and http://dl.fbaipublicfiles.com/KILT/blink-dev-kilt.jsonl. The format of the data is described in https://github.com/facebookresearch/KILT 😃

6reactions

bushjaviercommented, Oct 7, 2020

@belindal @ledw a lot of people are interested to train BLINK with our data, it would be nice if the authors provide some instructions to train the models, including all the steps required.

Thanks!

Top Results From Across the Web

Reading Wikipedia to Answer Open-Domain Questions

This repository includes code, data, and pre-trained models for processing and querying Wikipedia as described in the paper -- see Trained Models and...

Pre-processing a Wikipedia dump for NLP model training

Wikipedia dumps are used frequently in modern NLP research for model training, especially with transformers like BERT, RoBERTa, XLNet, XLM, ...

Index of /enwiki/

Index of /enwiki/ ../ 20220920/ 01-Nov-2022 09:28 - 20221001/ 01-Dec-2022 09:26 - 20221020/ 01-Dec-2022 09:27 - 20221101/ 20-Dec-2022 09:30 - 20221120/ ...

Wikipedia2Vec

Wikipedia2Vec is a tool used for obtaining embeddings (vector representations) of words and entities from Wikipedia. The emnbeddings can be used as word ......

Training Bangla LM from wikipedia data - Fast.ai forums

Hi, I am trying to train a Bangla language model using wikipedia articles. I used this script to download from wiki data dumps...