Exported JSON contains wrong offsets for HTML Entity Recognition
See original GitHub issueDescribe the bug When labeling an HTML file using the HTML Entity Recognition template, the offsets produced when exporting the result appear to be wrong.
To Reproduce
- Take the following HTML and save it to a .html file
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<p> This is a trial document. Let's see if the <b>offsets</b> will be correct.</p><br><p>Here's some tricky names:
David Höhler</p><p>Karl Köstler.</p><p>Some more tricky words<br></p><p>Geschäftsführer, München, Datenschutzerklärung.</p>
</body>
</html>
- Import it into a HTML Entity Recognition project
- Label some entities

- Export the data in the “default” JSON format
- Run the following Python code using the exported file
from pathlib import Path
import json
file = Path("exported_file.json")
label_json = json.loads(file.read_text())
for doc in label_json:
text = doc["data"]["html"]
for annotation in doc["annotations"][0]["result"]:
print("Real text:", annotation["value"]["text"])
print(
"Text according to offsets:",
repr(
text[
annotation["value"]["globalOffsets"]["start"] : annotation["value"][
"globalOffsets"
]["end"]
]
),
)
print("----")
Expected behavior What I’d expect to see:
Real text: David Höhler
Text according to offsets: 'David Höhler'
----
Real text: Karl Köstler
Text according to offsets: 'Karl Köstler'
[...]
What I actually get:
Real text: David Höhler
Text according to offsets: 's is a trial'
----
Real text: Karl Köstler
Text according to offsets: ' document. L'
[...]
Environment (please complete the following information):
- Ubuntu 18.04 LTS
- Label Studio v1.5.0 (Docker image)
Issue Analytics
- State:
- Created a year ago
- Reactions:8
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Label Studio Documentation — Export Annotations
Label Studio stores your annotations in a raw JSON format in the SQLite database backend, PostgreSQL database backend, or whichever cloud or database...
Read more >back up and recover your custom NER models - Microsoft Learn
Learn how to save and recover your custom NER models.
Read more >JSON Parser Error at Offset - TechDocs
This article provides information for Error: JSON Parser Error at Offset. The file being used to load an OpenAPI specification has a JSON...
Read more >Named Entity Recognition · Prodigy · An annotation tool for AI ...
So you have an NER problem you want to solve, and data to annotate. And you want to get it done as efficiently...
Read more >How to convert XML NER data from the CRAFT corpus to ...
Here is some code to get you going. It is not a complete solution, but the problem you posed is very hard, and...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi,
I think there is a bug with the offsets, regardless of the method to import the HTML files.
I have created a new project with the following
Labeling Interface
:In the
Data Import
step I added the following URL: https://en.wikipedia.org/wiki/Ottery_St_MaryThen I try to annotate that file, specifically this text as shown below:
The town as it now stands has several independent shops
When I export the file, I get this JSON:
Now, if I download the original HTML file https://en.wikipedia.org/wiki/Ottery_St_Mary, open it with Notepad++ and press Ctrl+G (or Search > Goto) and set the offset to 4486, it doesn’t take me to that text that I annotated. According to Notepad++ the right offset of that annotation is 94071 and not 4486.
Yeah I import them directly to the import window. I’ll try out the branch