CDATA fields are lost after calling Jsoup.parse
See original GitHub issueFirst, congratulations to your great library - it’s awesome! However we’re having an issue that got us in serious trouble. I’ll explain our scenario:
We’re running a Search/Replace mechanism on many pages of a CMS. The content of the pages is XHTML. The basic scheme we’re doing for each page is
String xhtml = page.getBody();
Document document = Jsoup.parse(xhtml, "", Parser.xmlParser());
// remove some content from document ...
page.setBody(document.text());
The big problem is that there is a lot of content that looks like
<some-node><![CDATA[some.string.content=content]]></some-node>
As soon as we call Jsoup.parse, the CDATA tag is gone and text() will produce this
<some-node>some.string.content=content></some-node>
What we have afterwards is a page with corrupt content.
We’d be very glad about some help, since we really enjoy using Jsoup otherwise!
Issue Analytics
- State:
- Created 9 years ago
- Reactions:3
- Comments:16 (5 by maintainers)
Top Results From Across the Web
Jsoup get contents of javascript that has CDATA tags?
If i use Jsoup to parse this page and try selecting all tha matching elements in the page with "script[type=text/javascript]" I get returned...
Read more >CDataNode (jsoup Java HTML Parser 1.15.3 API)
it starts with a protocol, like http:// or https:// etc), and it successfully parses as a URL, the attribute is returned directly. Otherwise,...
Read more >CHANGES · Gitee 极速下载/jsoup - Gitee.com
* Bugfix [Fuzz]: fixed a slow parse when a tag or an attribute name has thousands of null characters in it. <https://github.com/jhy/jsoup/issues/1580>.
Read more >publishToConfluence - docToolchain
is an array of files to upload to Confluence with the ability to configure a different parent page for each file. Attributes. file...
Read more >Working with HTML on the Web Using Java and jsoup - Twilio
So, you need to parse HTML in your Java application. Perhaps you are extracting data from a website that doesn't have an API, ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

I had the exact same challenge with Confluence as @ataraxie. If it helps anyone else, I ended up writing a helper function that restores the CDATA section. Perhaps not the prettiest solution, but it fulfilled my needs.
My use case I have a page that contains CDATA I used jsoup to parse and update some other tags. Pass along the updated page (toString()) But all the CDATA are gone… which is not what I desired…
Because of this, I simply have to give up using jsoup.