question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CDATA fields are lost after calling Jsoup.parse

See original GitHub issue

First, congratulations to your great library - it’s awesome! However we’re having an issue that got us in serious trouble. I’ll explain our scenario:

We’re running a Search/Replace mechanism on many pages of a CMS. The content of the pages is XHTML. The basic scheme we’re doing for each page is

String xhtml = page.getBody();
Document document = Jsoup.parse(xhtml, "", Parser.xmlParser());
// remove some content from document ...
page.setBody(document.text());

The big problem is that there is a lot of content that looks like

<some-node><![CDATA[some.string.content=content]]></some-node>

As soon as we call Jsoup.parse, the CDATA tag is gone and text() will produce this

<some-node>some.string.content=content></some-node>

What we have afterwards is a page with corrupt content.

We’d be very glad about some help, since we really enjoy using Jsoup otherwise!

Issue Analytics

  • State:closed
  • Created 9 years ago
  • Reactions:3
  • Comments:16 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
jonadacommented, Oct 23, 2016

I had the exact same challenge with Confluence as @ataraxie. If it helps anyone else, I ended up writing a helper function that restores the CDATA section. Perhaps not the prettiest solution, but it fulfilled my needs.

import org.springframework.web.util.HtmlUtils;

/**
 * Restores plain text in HTML to it's CDATA equivalent.
 * For example, jsoup parses CDATA section and returns HTML escaped string. This function restores it.
 * This is a 
 * 
 * @param html The html code
 * @param tag The enclosing tag of the text that shall be restored.
 * @return
 */
public static String restoreCDATA(String html, String tag) {
    int startIdx;
    int endIdx;
    String startTag = "<" + tag + ">";
    String endTag = "</" + tag + ">";

    // 1. Find next occurrence
    startIdx = html.indexOf(startTag);
    while (startIdx >= 0) {

        // 2. Find end boundary
        endIdx = html.indexOf(endTag, startIdx);
        if (endIdx < 0) break;

        // 3. Replace with "unescaped" text
        startIdx += startTag.length();
        html = html.substring(0, startIdx) + "<![CDATA[" + HtmlUtils.htmlUnescape(html.substring(startIdx, endIdx)) + "]]>" + html.substring(endIdx, html.length());

        // 5. Repeat for all occurrences
        startIdx = html.indexOf(startTag, endIdx + endTag.length());
    }
    return html;
}
1reaction
saiyinleungcommented, Oct 23, 2017

My use case I have a page that contains CDATA I used jsoup to parse and update some other tags. Pass along the updated page (toString()) But all the CDATA are gone… which is not what I desired…

Because of this, I simply have to give up using jsoup.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Jsoup get contents of javascript that has CDATA tags?
If i use Jsoup to parse this page and try selecting all tha matching elements in the page with "script[type=text/javascript]" I get returned...
Read more >
CDataNode (jsoup Java HTML Parser 1.15.3 API)
it starts with a protocol, like http:// or https:// etc), and it successfully parses as a URL, the attribute is returned directly. Otherwise,...
Read more >
CHANGES · Gitee 极速下载/jsoup - Gitee.com
* Bugfix [Fuzz]: fixed a slow parse when a tag or an attribute name has thousands of null characters in it. <https://github.com/jhy/jsoup/issues/1580>.
Read more >
publishToConfluence - docToolchain
is an array of files to upload to Confluence with the ability to configure a different parent page for each file. Attributes. file...
Read more >
Working with HTML on the Web Using Java and jsoup - Twilio
So, you need to parse HTML in your Java application. Perhaps you are extracting data from a website that doesn't have an API, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found