Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature Request: Question Answering dataset

See original GitHub issue

Describe the problem

Given the relative scarcity of tools for building a question-answering dataset, it would be great if doccano could serve that purpose. At its core, I believe that would mean each document is a passage, then there would be forms below (like in a Translation project) that allow you to enter in questions. Then you’d need the ability to annotate stretches of text as the answer (similar to a Sequence Labeling project). The main functionality that would be tricky would likely be tying the labels for the answer annotation back to the input questions.

A naive first pass could be to replicate a combination of the SequenceLabeling+MachineTranslation UIs together. The Labels for each answer could be simply answer-to-Q1, answer-to-Q2 and each document could simply have a max number of questions associated.

Does this sound doable?

Issue Analytics

State:
Created 5 years ago
Reactions:29
Comments:13 (7 by maintainers)

Top GitHub Comments

9reactions

SunYanCNcommented, Jun 9, 2019

This feature is really important and urgent, I look forward to joining this feature, I will continue to pay attention to this feature.

6reactions

jamesmfcommented, Jan 29, 2019

Note: the need for a tool like this is greater now that great datasets like SQuAD exist, because it’s more practical to train your own models. Given the fact that you can fine-tune BERT on SQuAD, it’d be invaluable to have a resource that lets you build your own SQuAD-formatted dataset that is domain specific!

Top Results From Across the Web

Question answering - Hugging Face Course

The dataset that is used the most as an academic benchmark for extractive question answering is SQuAD, so that's the one we'll use...

Question answering - NLP-progress

HotpotQA is a dataset with 113k Wikipedia-based question-answer pairs. Questions require finding and reasoning over multiple supporting documents and are not ...

Machine Learning Datasets - Papers With Code

The WebQuestions dataset is a question answering dataset using Freebase as the knowledge base and contains 6,642 question-answer pairs. It was created by ......

Amazon releases dataset for complex, multilingual question ...

Dataset that requires question-answering models to look up multiple facts and perform comparisons bridges a significant gap in the field.

Feature Request : Download dataset sample - Kaggle

Feature Request : Download dataset sample :) ... Questions & Answers ... I think most people would like to first try a hands...