Some details regarding generating NQ trainset for the reader modelSee original GitHub issue
Hi @AkariAsai. Thank you for this great work.
I’d like to understand more clearly how the NQ trainset for the reader model is generated. On your comment, you said that you removed all the tables and list elements from the NQ’s original preprocessed HTML data. https://github.com/AkariAsai/learning_to_retrieve_reasoning_paths/issues/9#issuecomment-610714692
I’m curious how you handled the case where a list element contains an answer and a paragraph contains the list? (like the following example) https://github.com/google-research-datasets/natural-questions/blob/master/toy_example.md
<p>Google was founded in 1998 By:<ul><li>Larry</li><li>Sergey</li></ul></p>
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
Thank you for the information!
Addressing your main question, we do not filter out the lists inside paragraphs, but we remove any HTML tags remaining in the context during our post-process. Thus, the example you mentioned above would be
Google was founded in 1998 By: Larry Sergey, but there might be some corner cases we missed.
In particular, we remove long answer candidate which do not start or end with paragraph tags (i.e.,
</P>), and thus purely table / list based items are filtered out, but we do not further filter out the table or list elements included in paragraphs.