question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Jsoup parse fails to descend block elements inside an anchor element correctly

See original GitHub issue

_With input DOM given below and a simple traveral test, Jsoup yields an unexpected DOM walk_

<p>
  <a>beyond good and evil
    <div></div>
  </a>
  thus spoke zarathustra
</p>
 static class LoggingVisitor implements NodeVisitor {
    private final Logger logger = LoggerFactory.getLogger("Tree depth logger");

    @Override
    public void head(Node node, int depth) {
      logger.info("head node " + node.nodeName() + " with depth " + depth);
    }

    @Override
    public void tail(Node node, int depth) {
      logger.info("tail node " + node.nodeName() + " with depth " + depth);
    }
  };

  @Test
  public void testJsoupTraversalWithBlockElementInsideAnchor() {
    String body = 
      "<p>" +
        "<a>beyond good and evil" +
          "<div></div>" +
        "</a>" +
        "thus spoke zarathustra" +
       "</p>";
    Document doc = Jsoup.parseBodyFragment(body);
    LoggingVisitor visitor = new LoggingVisitor();
    NodeTraversor traversor = new NodeTraversor(visitor);
    traversor.traverse(doc.body());
  }

_Printing the Jsoup parsed document shows rogue html_

<html>
 <head></head>
 <body>
  <p><a>beyond good and evil</a></p>
  <div></div>thus spoke zarathustra
  <p></p>
 </body>
</html>
INFO   [11:01:28.070] [main] Tree depth logger -  head node body with depth 0 
INFO   [11:01:28.073] [main] Tree depth logger -  head node p with depth 1 
INFO   [11:01:28.073] [main] Tree depth logger -  head node a with depth 2 
INFO   [11:01:28.073] [main] Tree depth logger -  head node #text with depth 3 
INFO   [11:01:28.073] [main] Tree depth logger -  tail node #text with depth 3 
INFO   [11:01:28.073] [main] Tree depth logger -  tail node a with depth 2 
INFO   [11:01:28.074] [main] Tree depth logger -  tail node p with depth 1 
INFO   [11:01:28.074] [main] Tree depth logger -  head node div with depth 1 
INFO   [11:01:28.074] [main] Tree depth logger -  tail node div with depth 1 
INFO   [11:01:28.074] [main] Tree depth logger -  head node #text with depth 1 
INFO   [11:01:28.074] [main] Tree depth logger -  tail node #text with depth 1 
INFO   [11:01:28.074] [main] Tree depth logger -  head node p with depth 1 
INFO   [11:01:28.074] [main] Tree depth logger -  tail node p with depth 1 
INFO   [11:01:28.074] [main] Tree depth logger -  tail node body with depth 0 

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
cketticommented, Mar 16, 2017

@worldsense-tms: This is non-conforming HTML which will always parse in unexpected (well, unless you read the specification) ways. However, ending up with nested anchor elements in the DOM tree is a bug. I opened issue #845 for that.

0reactions
worldsense-tmscommented, Mar 16, 2017

I did some debugging. At the time we encounter the child anchor, the stack inside HtmlTreeBuilder looks like this:

html, body, div, a, h2

Then in HtmlTreeBuilderState line 284 we remove the anchor from the stack:

html, body, div, h2

But the anchor is still a child of the div. Here’s a screenshot of the debug panel in IntelliJ:

jsoup-nested-anchor

I would bet that the element gets “reinserted” because it’s still in the div.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Jsoup: Extracting innertext from anchor tag - Stack Overflow
While trying this in Jsoup I found that the innertext goes outside the anchor tag when parsed by Jsoup. Here's my code. Document...
Read more >
Jsoup Fails To Get Outer Html With Nested Tags - ADocLib
Jsoup parse fails to descend block elements inside an anchor element correctly #728 Just noticed that the broken behavior occurs only when the...
Read more >
Working with URLs: jsoup Java HTML parser
You have a HTML document that contains relative URLs, which you need to resolve to absolute URLs. Solution. Make sure you specify a...
Read more >
Bug List - Bugs - Eclipse
ID Product Comp Assignee△ Status△ Changed 351122 Mylyn Do HtmlText mylyn‑triaged NEW 2011‑08‑17 386538 Mylyn Do EPUB torkildr NEW 2013‑10‑21 580578 Mylyn Do Wikitext bronwyn.damm NEW...
Read more >
Apache JMeter - History of Previous Changes
View Results Tree may fail to display some HTML code under HTML renderer, see Bug 54586. This is due to a known Java...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found