Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Erroneous parsing of XML content

See original GitHub issue

Hello,

I’ve noticed that when processing certain XML content, the parser malfunctions. That’s the snippet of the XML document that I am processing:

والجدير بالذكر أن الدورات النقابية المتعاقبة منذ عام 1950م وحتى 2011م بلغت خمسة عشر دورة تتفاوت في آجالها من عام لعامين ولأربعة أعوام ثم خمسة أعوام اعتباراً من دورة 1996_2001 م حسب نصوص القوانين وتطوراته </doc> <doc id="1432834" url="https://ar.wikipedia.org/wiki?curid=1432834" title="ريتا حايك">

However, after parsing the file (either with doc = Jsoup.parse(File f) or doc = Jsoup.parseBodyFragment(String s)), when printing the element’s content (doc.html()), I notice that one doc closing tag turns to a comment closing:

والجدير بالذكر أن الدورات النقابية المتعاقبة منذ عام 1950م وحتى 2011م بلغت خمسة عشر دورة تتفاوت في آجالها من عام لعامين ولأربعة أعوام ثم خمسة أعوام اعتباراً من دورة 1996_2001 م حسب نصوص القوانين وتطوراته </doc--> <doc id="1432834" url="https://ar.wikipedia.org/wiki?curid=1432834" title="ريتا حايك">

Because of that all the remaining content (over 50MB) is loaded as one doc. Does anything come to your mind? Am I doing something wrong or should this be considered a bug?

Issue Analytics

State:
Created 6 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

wojtuchcommented, Oct 23, 2017

Hi!

The following file is produced by https://github.com/attardi/wikiextractor applied on the arabic wikipedia dump: jsoup-problem.txt

$ file jsoup-problem.txt
jsoup-problem.txt: UTF-8 Unicode text, with very long lines

I guess there is no BOM, I couldn’t find any other unusual tag neither.

0reactions

jhycommented, Jul 10, 2021

(Closing out as we’ve had no other related reports which I would expect if there is an endemic issue - expect this was root-caused by a document flipping around the text order.)

Top Results From Across the Web

XML Parser Error Codes - IBM

If the XML parser detects an error in the XML document during parsing, message RNX0351 will be issued. From the message, you can...

XML Parsing error - The W3C Markup Validation Service

The most common cause is encoding errors. There are several basic approaches to solving this: escaping problematic characters ( < becomes < ,...

Word error: XML parsing error | Dradis Pro Help

To resolve this error: · Rename the file from . · Unzip the file and open the new folder (e.g. dradis-word_report-151/). · Scroll...

why i am getting a fatal error when parsing xml file? [duplicate]

I am parsing an XML file to java, there is a wrong starting tag in the file. When I run the Code it...

Confluence Throws 'The XML Content Could Not Be Parsed ...

This happens due to an existing character which conflicts with XML standards. The offending character appears in the error message as String ']] ......