Feature Request: Question Answering dataset
See original GitHub issueDescribe the problem
Given the relative scarcity of tools for building a question-answering dataset, it would be great if doccano could serve that purpose. At its core, I believe that would mean each document is a passage, then there would be forms below (like in a Translation project) that allow you to enter in questions. Then you’d need the ability to annotate stretches of text as the answer (similar to a Sequence Labeling project). The main functionality that would be tricky would likely be tying the labels for the answer annotation back to the input questions.
A naive first pass could be to replicate a combination of the SequenceLabeling+MachineTranslation UIs together. The Labels for each answer could be simply answer-to-Q1
, answer-to-Q2
and each document could simply have a max number of questions associated.
Does this sound doable?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:29
- Comments:13 (7 by maintainers)
Top GitHub Comments
This feature is really important and urgent, I look forward to joining this feature, I will continue to pay attention to this feature.
Note: the need for a tool like this is greater now that great datasets like SQuAD exist, because it’s more practical to train your own models. Given the fact that you can fine-tune BERT on SQuAD, it’d be invaluable to have a resource that lets you build your own SQuAD-formatted dataset that is domain specific!