question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Minimal answer spans are wrong for some examples.

See original GitHub issue

When I try to read the data and print questions and minimal span answers, I noticed that for some examples plaintext_start_byte and plaintext_end_byte of minimal answer span are shifted by k symbols right.

Example: Question: ‘Who created the series Clannad?’ Minimal span: ‘rel’ (it should be ‘Key’, but its values are shifted and it gives ‘rel’ part of the word ‘released’)

... Clannad(クラナド,Kuranado) is a Japanese visual novel developed by Key and released on April 28, 2004 ...

This is how I read the data file:


    with open(path_name) as input_file:
        for line in input_file:
            try:
                json_example = json.loads(line)
                if json_example['language'] not in allowed_langs:
                    continue

                plain_text = json_example['document_plaintext']
...

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7

github_iconTop GitHub Comments

3reactions
dhgarrettecommented, Aug 4, 2020

I suspect that you’re doing a slice on the Python string’s characters instead of bytes. In Python3, string indices are given in unicode characters (unlike Python2, which used bytes). The example you gave has non-ascii characters that take up more than one byte, so the byte and character indices will not be the same:

>>> plaintext = "Clannad(クラナド,Kuranado) is a Japanese visual novel developed by Key and released on"
>>> plaintext_start_byte = 71
>>> plaintext_end_byte = 74
>>> plaintext[plaintext_start_byte:plaintext_end_byte]  # Incorrect
'rel'
>>> plaintext.encode()[plaintext_start_byte:plaintext_end_byte].decode()  # Correct
'Key'
0reactions
hashedicommented, Mar 25, 2022

I think @tomohideshibata code works fine

Read more comments on GitHub >

github_iconTop Results From Across the Web

Question Answering with Long Multiple-Span ... - ACL Anthology
Answering questions in many real-world appli- cations often requires complex and precise in- formation excerpted from texts spanned across a long document.
Read more >
Question Answering with Long Multiple-Span ... - Virginia Tech
1 shows an example question, and its corresponding context and answer from our dataset, which poses several unique challenges. First, the contexts are ......
Read more >
Rejecting bad data spans and breaks - MNE-Python
Rejecting bad data spans and breaks#. This tutorial covers: manual marking of bad spans of data,. automated rejection of data spans based on ......
Read more >
F1 score in NLP span-based Question Answering task
where tp stands for true positive, fp for false positive and fn for false negative. The definition of a F1 score is not...
Read more >
Context-Aware Answer Extraction in Question Answering
answer-spans in the relevant contexts from given passages, they sometimes result in predicting the. Figure 1: Example passage, question, and answer triple.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found