question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Erroneous parsing of XML content

See original GitHub issue

Hello,

I’ve noticed that when processing certain XML content, the parser malfunctions. That’s the snippet of the XML document that I am processing:

والجدير بالذكر أن الدورات النقابية المتعاقبة منذ عام 1950م وحتى 2011م بلغت خمسة عشر دورة تتفاوت في آجالها من عام لعامين ولأربعة أعوام ثم خمسة أعوام اعتباراً من دورة 1996_2001 م حسب نصوص القوانين وتطوراته </doc> <doc id="1432834" url="https://ar.wikipedia.org/wiki?curid=1432834" title="ريتا حايك">

However, after parsing the file (either with doc = Jsoup.parse(File f) or doc = Jsoup.parseBodyFragment(String s)), when printing the element’s content (doc.html()), I notice that one doc closing tag turns to a comment closing:

والجدير بالذكر أن الدورات النقابية المتعاقبة منذ عام 1950م وحتى 2011م بلغت خمسة عشر دورة تتفاوت في آجالها من عام لعامين ولأربعة أعوام ثم خمسة أعوام اعتباراً من دورة 1996_2001 م حسب نصوص القوانين وتطوراته </doc--> <doc id="1432834" url="https://ar.wikipedia.org/wiki?curid=1432834" title="ريتا حايك">

Because of that all the remaining content (over 50MB) is loaded as one doc. Does anything come to your mind? Am I doing something wrong or should this be considered a bug?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
wojtuchcommented, Oct 23, 2017

Hi!

The following file is produced by https://github.com/attardi/wikiextractor applied on the arabic wikipedia dump: jsoup-problem.txt

$ file jsoup-problem.txt
jsoup-problem.txt: UTF-8 Unicode text, with very long lines

I guess there is no BOM, I couldn’t find any other unusual tag neither.

0reactions
jhycommented, Jul 10, 2021

(Closing out as we’ve had no other related reports which I would expect if there is an endemic issue - expect this was root-caused by a document flipping around the text order.)

Read more comments on GitHub >

github_iconTop Results From Across the Web

XML Parser Error Codes - IBM
If the XML parser detects an error in the XML document during parsing, message RNX0351 will be issued. From the message, you can...
Read more >
XML Parsing error - The W3C Markup Validation Service
The most common cause is encoding errors. There are several basic approaches to solving this: escaping problematic characters ( < becomes < ,...
Read more >
Word error: XML parsing error | Dradis Pro Help
To resolve this error: · Rename the file from . · Unzip the file and open the new folder (e.g. dradis-word_report-151/). · Scroll...
Read more >
why i am getting a fatal error when parsing xml file? [duplicate]
I am parsing an XML file to java, there is a wrong starting tag in the file. When I run the Code it...
Read more >
Confluence Throws 'The XML Content Could Not Be Parsed ...
This happens due to an existing character which conflicts with XML standards. The offending character appears in the error message as String ']] ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found