question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Exported JSON contains wrong offsets for HTML Entity Recognition

See original GitHub issue

Describe the bug When labeling an HTML file using the HTML Entity Recognition template, the offsets produced when exporting the result appear to be wrong.

To Reproduce

  1. Take the following HTML and save it to a .html file
<!DOCTYPE html>
<html lang="en">
  <head>
      <meta charset="UTF-8">
      <title>Title</title>
  </head>
  <body>
  <p> This is a trial document. Let's see if the <b>offsets</b> will be correct.</p><br><p>Here's some tricky names:
      David H&ouml;hler</p><p>Karl K&ouml;stler.</p><p>Some more tricky words<br></p><p>Gesch&auml;ftsf&uuml;hrer, M&uuml;nchen, Datenschutzerkl&auml;rung.</p>
  </body>
</html>
  1. Import it into a HTML Entity Recognition project
  2. Label some entities
image
  1. Export the data in the “default” JSON format
  2. Run the following Python code using the exported file
from pathlib import Path
import json

file = Path("exported_file.json")
label_json = json.loads(file.read_text())


for doc in label_json:
    text = doc["data"]["html"]

    for annotation in doc["annotations"][0]["result"]:
        print("Real text:", annotation["value"]["text"])
        print(
            "Text according to offsets:",
            repr(
                text[
                    annotation["value"]["globalOffsets"]["start"] : annotation["value"][
                        "globalOffsets"
                    ]["end"]
                ]
            ),
        )
        print("----")

Expected behavior What I’d expect to see:

Real text: David Höhler
Text according to offsets: 'David Höhler'
----
Real text: Karl Köstler
Text according to offsets: 'Karl Köstler'
[...]

What I actually get:

Real text: David Höhler
Text according to offsets: 's is a trial'
----
Real text: Karl Köstler
Text according to offsets: ' document. L'
[...]

Environment (please complete the following information):

  • Ubuntu 18.04 LTS
  • Label Studio v1.5.0 (Docker image)

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:8
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
HodeiGcommented, Oct 3, 2022

Hi,

I think there is a bug with the offsets, regardless of the method to import the HTML files.

I have created a new project with the following Labeling Interface:

<View>
  <Labels name="ner" toName="text">
    <Label value="Person"></Label>
    <Label value="Organization"></Label>
  </Labels>
  <HyperText name="text" value="$text" valueType="url"></HyperText>
</View>

In the Data Import step I added the following URL: https://en.wikipedia.org/wiki/Ottery_St_Mary

Then I try to annotate that file, specifically this text as shown below: The town as it now stands has several independent shops

image

When I export the file, I get this JSON:

[{
        "id": 1,
        "annotations": [{
                "id": 2,
                "completed_by": 1,
                "result": [{
                        "value": {
                            "start": "\/div[3]\/div[3]\/div[5]\/div[1]\/p[4]\/text()[1]",
                            "end": "\/div[3]\/div[3]\/div[5]\/div[1]\/p[4]\/text()[1]",
                            "startOffset": 0,
                            "endOffset": 55,
                            "globalOffsets": {
                                "start": 4486,
                                "end": 4541
                            },
                            "labels": ["Person"]
                        },
                        "id": "vaAXPxkauU",
                        "from_name": "ner",
                        "to_name": "text",
                        "type": "labels",
                        "origin": "manual"
                    }
                ],
                "was_cancelled": false,
                "ground_truth": false,
                "created_at": "2022-10-03T12:05:36.723291Z",
                "updated_at": "2022-10-03T12:05:36.723321Z",
                "lead_time": 21.639,
                "prediction": {},
                "result_count": 0,
                "task": 1,
                "parent_prediction": null,
                "parent_annotation": null
            }
        ],
        "file_upload": "a81371ea-Ottery_St_Mary",
        "drafts": [],
        "predictions": [],
        "data": {
            "text": "\/data\/upload\/1\/a81371ea-Ottery_St_Mary"
        },
        "meta": {},
        "created_at": "2022-10-03T11:23:08.997546Z",
        "updated_at": "2022-10-03T12:05:36.794640Z",
        "inner_id": 1,
        "total_annotations": 1,
        "cancelled_annotations": 0,
        "total_predictions": 0,
        "comment_count": 0,
        "unresolved_comment_count": 0,
        "last_comment_updated_at": null,
        "project": 1,
        "updated_by": 1,
        "comment_authors": []
    }
]

Now, if I download the original HTML file https://en.wikipedia.org/wiki/Ottery_St_Mary, open it with Notepad++ and press Ctrl+G (or Search > Goto) and set the offset to 4486, it doesn’t take me to that text that I annotated. According to Notepad++ the right offset of that annotation is 94071 and not 4486.

1reaction
wpnboscommented, Aug 26, 2022

Yeah I import them directly to the import window. I’ll try out the branch

Read more comments on GitHub >

github_iconTop Results From Across the Web

Label Studio Documentation — Export Annotations
Label Studio stores your annotations in a raw JSON format in the SQLite database backend, PostgreSQL database backend, or whichever cloud or database...
Read more >
back up and recover your custom NER models - Microsoft Learn
Learn how to save and recover your custom NER models.
Read more >
JSON Parser Error at Offset - TechDocs
This article provides information for Error: JSON Parser Error at Offset. The file being used to load an OpenAPI specification has a JSON...
Read more >
Named Entity Recognition · Prodigy · An annotation tool for AI ...
So you have an NER problem you want to solve, and data to annotate. And you want to get it done as efficiently...
Read more >
How to convert XML NER data from the CRAFT corpus to ...
Here is some code to get you going. It is not a complete solution, but the problem you posed is very hard, and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found