Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Minimal answer spans are wrong for some examples.

See original GitHub issue

When I try to read the data and print questions and minimal span answers, I noticed that for some examples plaintext_start_byte and plaintext_end_byte of minimal answer span are shifted by k symbols right.

Example: Question: ‘Who created the series Clannad?’ Minimal span: ‘rel’ (it should be ‘Key’, but its values are shifted and it gives ‘rel’ part of the word ‘released’)

... Clannad(クラナド,Kuranado) is a Japanese visual novel developed by Key and released on April 28, 2004 ...

This is how I read the data file:


    with open(path_name) as input_file:
        for line in input_file:
            try:
                json_example = json.loads(line)
                if json_example['language'] not in allowed_langs:
                    continue

                plain_text = json_example['document_plaintext']
...

Issue Analytics

State:
Created 3 years ago
Comments:7

Top GitHub Comments

3reactions

dhgarrettecommented, Aug 4, 2020

I suspect that you’re doing a slice on the Python string’s characters instead of bytes. In Python3, string indices are given in unicode characters (unlike Python2, which used bytes). The example you gave has non-ascii characters that take up more than one byte, so the byte and character indices will not be the same:

>>> plaintext = "Clannad(クラナド,Kuranado) is a Japanese visual novel developed by Key and released on"
>>> plaintext_start_byte = 71
>>> plaintext_end_byte = 74
>>> plaintext[plaintext_start_byte:plaintext_end_byte]  # Incorrect
'rel'
>>> plaintext.encode()[plaintext_start_byte:plaintext_end_byte].decode()  # Correct
'Key'

0reactions

hashedicommented, Mar 25, 2022

I think @tomohideshibata code works fine

Top Results From Across the Web

Question Answering with Long Multiple-Span ... - ACL Anthology

Answering questions in many real-world appli- cations often requires complex and precise in- formation excerpted from texts spanned across a long document.

Question Answering with Long Multiple-Span ... - Virginia Tech

1 shows an example question, and its corresponding context and answer from our dataset, which poses several unique challenges. First, the contexts are ......

Rejecting bad data spans and breaks - MNE-Python

Rejecting bad data spans and breaks#. This tutorial covers: manual marking of bad spans of data,. automated rejection of data spans based on ......

F1 score in NLP span-based Question Answering task

where tp stands for true positive, fp for false positive and fn for false negative. The definition of a F1 score is not...

Context-Aware Answer Extraction in Question Answering

answer-spans in the relevant contexts from given passages, they sometimes result in predicting the. Figure 1: Example passage, question, and answer triple.