Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Exported JSON contains wrong offsets for HTML Entity Recognition

See original GitHub issue

Describe the bug When labeling an HTML file using the HTML Entity Recognition template, the offsets produced when exporting the result appear to be wrong.

To Reproduce

Take the following HTML and save it to a .html file

<!DOCTYPE html>
<html lang="en">
  <head>
      <meta charset="UTF-8">
      <title>Title</title>
  </head>
  <body>
  <p> This is a trial document. Let's see if the <b>offsets</b> will be correct.</p><br><p>Here's some tricky names:
      David H&ouml;hler</p><p>Karl K&ouml;stler.</p><p>Some more tricky words<br></p><p>Gesch&auml;ftsf&uuml;hrer, M&uuml;nchen, Datenschutzerkl&auml;rung.</p>
  </body>
</html>

Import it into a HTML Entity Recognition project
Label some entities

Export the data in the “default” JSON format
Run the following Python code using the exported file

from pathlib import Path
import json

file = Path("exported_file.json")
label_json = json.loads(file.read_text())


for doc in label_json:
    text = doc["data"]["html"]

    for annotation in doc["annotations"][0]["result"]:
        print("Real text:", annotation["value"]["text"])
        print(
            "Text according to offsets:",
            repr(
                text[
                    annotation["value"]["globalOffsets"]["start"] : annotation["value"][
                        "globalOffsets"
                    ]["end"]
                ]
            ),
        )
        print("----")

Expected behavior What I’d expect to see:

Real text: David Höhler
Text according to offsets: 'David Höhler'
----
Real text: Karl Köstler
Text according to offsets: 'Karl Köstler'
[...]

What I actually get:

Real text: David Höhler
Text according to offsets: 's is a trial'
----
Real text: Karl Köstler
Text according to offsets: ' document. L'
[...]

Environment (please complete the following information):

Ubuntu 18.04 LTS
Label Studio v1.5.0 (Docker image)

Issue Analytics

State:
Created a year ago
Reactions:8
Comments:6 (2 by maintainers)

Top GitHub Comments

2reactions

HodeiGcommented, Oct 3, 2022

Hi,

I think there is a bug with the offsets, regardless of the method to import the HTML files.

I have created a new project with the following Labeling Interface:

<View>
  <Labels name="ner" toName="text">
    <Label value="Person"></Label>
    <Label value="Organization"></Label>
  </Labels>
  <HyperText name="text" value="$text" valueType="url"></HyperText>
</View>

In the Data Import step I added the following URL: https://en.wikipedia.org/wiki/Ottery_St_Mary

Then I try to annotate that file, specifically this text as shown below: The town as it now stands has several independent shops

When I export the file, I get this JSON:

[{
        "id": 1,
        "annotations": [{
                "id": 2,
                "completed_by": 1,
                "result": [{
                        "value": {
                            "start": "\/div[3]\/div[3]\/div[5]\/div[1]\/p[4]\/text()[1]",
                            "end": "\/div[3]\/div[3]\/div[5]\/div[1]\/p[4]\/text()[1]",
                            "startOffset": 0,
                            "endOffset": 55,
                            "globalOffsets": {
                                "start": 4486,
                                "end": 4541
                            },
                            "labels": ["Person"]
                        },
                        "id": "vaAXPxkauU",
                        "from_name": "ner",
                        "to_name": "text",
                        "type": "labels",
                        "origin": "manual"
                    }
                ],
                "was_cancelled": false,
                "ground_truth": false,
                "created_at": "2022-10-03T12:05:36.723291Z",
                "updated_at": "2022-10-03T12:05:36.723321Z",
                "lead_time": 21.639,
                "prediction": {},
                "result_count": 0,
                "task": 1,
                "parent_prediction": null,
                "parent_annotation": null
            }
        ],
        "file_upload": "a81371ea-Ottery_St_Mary",
        "drafts": [],
        "predictions": [],
        "data": {
            "text": "\/data\/upload\/1\/a81371ea-Ottery_St_Mary"
        },
        "meta": {},
        "created_at": "2022-10-03T11:23:08.997546Z",
        "updated_at": "2022-10-03T12:05:36.794640Z",
        "inner_id": 1,
        "total_annotations": 1,
        "cancelled_annotations": 0,
        "total_predictions": 0,
        "comment_count": 0,
        "unresolved_comment_count": 0,
        "last_comment_updated_at": null,
        "project": 1,
        "updated_by": 1,
        "comment_authors": []
    }
]

Now, if I download the original HTML file https://en.wikipedia.org/wiki/Ottery_St_Mary, open it with Notepad++ and press Ctrl+G (or Search > Goto) and set the offset to 4486, it doesn’t take me to that text that I annotated. According to Notepad++ the right offset of that annotation is 94071 and not 4486.

1reaction

wpnboscommented, Aug 26, 2022

Yeah I import them directly to the import window. I’ll try out the branch

Top Results From Across the Web

Label Studio Documentation — Export Annotations

Label Studio stores your annotations in a raw JSON format in the SQLite database backend, PostgreSQL database backend, or whichever cloud or database...

back up and recover your custom NER models - Microsoft Learn

Learn how to save and recover your custom NER models.

JSON Parser Error at Offset - TechDocs

This article provides information for Error: JSON Parser Error at Offset. The file being used to load an OpenAPI specification has a JSON...

Named Entity Recognition · Prodigy · An annotation tool for AI ...

So you have an NER problem you want to solve, and data to annotate. And you want to get it done as efficiently...

How to convert XML NER data from the CRAFT corpus to ...

Here is some code to get you going. It is not a complete solution, but the problem you posed is very hard, and...