Pre-train on Wikipedia dump: Questions about data
See original GitHub issueHello,
Nice paper! 😃 I want to train the bi-encoder as described in section 5.2.2 of your paper and have some questions about the data that you used.
Can you clarify how the subset of the linked mentions is selected?
we pre-train our models on Wikipedia data. We use the May 2019 English Wikipedia dump which includes 5.9M entities, and use the hyperlinks in articles as examples (the anchor text is the mention). We use a subset of all Wikipedia linked mentions as our training data (A total of 9M examples).
What is the format of the input data for training the model?
train_biencoder.py
tries to load training data from a train.jsonl
. Can you give a few example rows for such a file?
Is get_processed_data.sh used to process the data?
The name would suggest so, lol. However the README.md
of that folder says [deprecated], so I am not sure. (Maybe you could remove the deprecated code from the repository, and use a release tag instead for the old code.)
Could you upload the processed training data?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:9
- Comments:7 (2 by maintainers)
Top GitHub Comments
the training data we used can be download from http://dl.fbaipublicfiles.com/KILT/blink-train-kilt.jsonl and http://dl.fbaipublicfiles.com/KILT/blink-dev-kilt.jsonl. The format of the data is described in https://github.com/facebookresearch/KILT 😃
@belindal @ledw a lot of people are interested to train BLINK with our data, it would be nice if the authors provide some instructions to train the models, including all the steps required.
Thanks!