Some details regarding generating NQ trainset for the reader model
See original GitHub issueIssue Description
Hi @AkariAsai. Thank you for this great work.
I’d like to understand more clearly how the NQ trainset for the reader model is generated. On your comment, you said that you removed all the tables and list elements from the NQ’s original preprocessed HTML data. https://github.com/AkariAsai/learning_to_retrieve_reasoning_paths/issues/9#issuecomment-610714692
I’m curious how you handled the case where a list element contains an answer and a paragraph contains the list? (like the following example) https://github.com/google-research-datasets/natural-questions/blob/master/toy_example.md
eg. <p>Google was founded in 1998 By:<ul><li>Larry</li><li>Sergey</li></ul></p>
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
facebookresearch/DPR: Dense Passage Retriever - GitHub
A new bi-encoder model trained on NQ dataset only is now provided: a new checkpoint, training data, retrieval results and wikipedia embeddings.
Read more >Training Pipelines & Models · spaCy Usage Documentation
Training is an iterative process in which the model's predictions are compared against the reference annotations in order to estimate the gradient of...
Read more >How to Get Started Collecting Model Trains - TrainLife.com
1. HO SCALE: THE MOST POPULAR
HO, or H0, is a train modeling scale using a 1: 87(3.5 mm to 1 foot) scale. It’s...
Read more >Top 50 NLP Interview Questions and Answers in 2023
We have curated a list of the top commonly asked NLP interview questions and answers that will help you ace your interviews.
Read more >Summary | Reading Quiz - Quizizz
After they began using circus trains, Barnum and Coup only brought their show to large cities. These performances were much more profitable and...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thank you for the information!
Addressing your main question, we do not filter out the lists inside paragraphs, but we remove any HTML tags remaining in the context during our post-process. Thus, the example you mentioned above would be
Google was founded in 1998 By: Larry Sergey
, but there might be some corner cases we missed.In particular, we remove long answer candidate which do not start or end with paragraph tags (i.e.,
<P>
and</P>
), and thus purely table / list based items are filtered out, but we do not further filter out the table or list elements included in paragraphs.