question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pre-train on Wikipedia dump: Questions about data

See original GitHub issue

Hello,

Nice paper! 😃 I want to train the bi-encoder as described in section 5.2.2 of your paper and have some questions about the data that you used.

Can you clarify how the subset of the linked mentions is selected?

we pre-train our models on Wikipedia data. We use the May 2019 English Wikipedia dump which includes 5.9M entities, and use the hyperlinks in articles as examples (the anchor text is the mention). We use a subset of all Wikipedia linked mentions as our training data (A total of 9M examples).

What is the format of the input data for training the model? train_biencoder.py tries to load training data from a train.jsonl. Can you give a few example rows for such a file?

Is get_processed_data.sh used to process the data? The name would suggest so, lol. However the README.md of that folder says [deprecated], so I am not sure. (Maybe you could remove the deprecated code from the repository, and use a release tag instead for the old code.)

Could you upload the processed training data?

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:9
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

7reactions
fabiopetronicommented, Dec 14, 2020
6reactions
bushjaviercommented, Oct 7, 2020

@belindal @ledw a lot of people are interested to train BLINK with our data, it would be nice if the authors provide some instructions to train the models, including all the steps required.

Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Reading Wikipedia to Answer Open-Domain Questions
This repository includes code, data, and pre-trained models for processing and querying Wikipedia as described in the paper -- see Trained Models and...
Read more >
Pre-processing a Wikipedia dump for NLP model training
Wikipedia dumps are used frequently in modern NLP research for model training, especially with transformers like BERT, RoBERTa, XLNet, XLM, ...
Read more >
Index of /enwiki/
Index of /enwiki/ ../ 20220920/ 01-Nov-2022 09:28 - 20221001/ 01-Dec-2022 09:26 - 20221020/ 01-Dec-2022 09:27 - 20221101/ 20-Dec-2022 09:30 - 20221120/ ...
Read more >
Wikipedia2Vec
Wikipedia2Vec is a tool used for obtaining embeddings (vector representations) of words and entities from Wikipedia. The emnbeddings can be used as word ......
Read more >
Training Bangla LM from wikipedia data - Fast.ai forums
Hi, I am trying to train a Bangla language model using wikipedia articles. I used this script to download from wiki data dumps...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found