Jsoup parse fails to descend block elements inside an anchor element correctly
See original GitHub issue_With input DOM given below and a simple traveral test, Jsoup yields an unexpected DOM walk_
<p>
<a>beyond good and evil
<div></div>
</a>
thus spoke zarathustra
</p>
static class LoggingVisitor implements NodeVisitor {
private final Logger logger = LoggerFactory.getLogger("Tree depth logger");
@Override
public void head(Node node, int depth) {
logger.info("head node " + node.nodeName() + " with depth " + depth);
}
@Override
public void tail(Node node, int depth) {
logger.info("tail node " + node.nodeName() + " with depth " + depth);
}
};
@Test
public void testJsoupTraversalWithBlockElementInsideAnchor() {
String body =
"<p>" +
"<a>beyond good and evil" +
"<div></div>" +
"</a>" +
"thus spoke zarathustra" +
"</p>";
Document doc = Jsoup.parseBodyFragment(body);
LoggingVisitor visitor = new LoggingVisitor();
NodeTraversor traversor = new NodeTraversor(visitor);
traversor.traverse(doc.body());
}
_Printing the Jsoup parsed document shows rogue html_
<html>
<head></head>
<body>
<p><a>beyond good and evil</a></p>
<div></div>thus spoke zarathustra
<p></p>
</body>
</html>
INFO [11:01:28.070] [main] Tree depth logger - head node body with depth 0
INFO [11:01:28.073] [main] Tree depth logger - head node p with depth 1
INFO [11:01:28.073] [main] Tree depth logger - head node a with depth 2
INFO [11:01:28.073] [main] Tree depth logger - head node #text with depth 3
INFO [11:01:28.073] [main] Tree depth logger - tail node #text with depth 3
INFO [11:01:28.073] [main] Tree depth logger - tail node a with depth 2
INFO [11:01:28.074] [main] Tree depth logger - tail node p with depth 1
INFO [11:01:28.074] [main] Tree depth logger - head node div with depth 1
INFO [11:01:28.074] [main] Tree depth logger - tail node div with depth 1
INFO [11:01:28.074] [main] Tree depth logger - head node #text with depth 1
INFO [11:01:28.074] [main] Tree depth logger - tail node #text with depth 1
INFO [11:01:28.074] [main] Tree depth logger - head node p with depth 1
INFO [11:01:28.074] [main] Tree depth logger - tail node p with depth 1
INFO [11:01:28.074] [main] Tree depth logger - tail node body with depth 0
Issue Analytics
- State:
- Created 7 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Jsoup: Extracting innertext from anchor tag - Stack Overflow
While trying this in Jsoup I found that the innertext goes outside the anchor tag when parsed by Jsoup. Here's my code. Document...
Read more >Jsoup Fails To Get Outer Html With Nested Tags - ADocLib
Jsoup parse fails to descend block elements inside an anchor element correctly #728 Just noticed that the broken behavior occurs only when the...
Read more >Working with URLs: jsoup Java HTML parser
You have a HTML document that contains relative URLs, which you need to resolve to absolute URLs. Solution. Make sure you specify a...
Read more >Bug List - Bugs - Eclipse
ID Product Comp Assignee△ Status△ Changed
351122 Mylyn Do HtmlText mylyn‑triaged NEW 2011‑08‑17
386538 Mylyn Do EPUB torkildr NEW 2013‑10‑21
580578 Mylyn Do Wikitext bronwyn.damm NEW...
Read more >Apache JMeter - History of Previous Changes
View Results Tree may fail to display some HTML code under HTML renderer, see Bug 54586. This is due to a known Java...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@worldsense-tms: This is non-conforming HTML which will always parse in unexpected (well, unless you read the specification) ways. However, ending up with nested anchor elements in the DOM tree is a bug. I opened issue #845 for that.
I did some debugging. At the time we encounter the child anchor, the stack inside HtmlTreeBuilder looks like this:
Then in HtmlTreeBuilderState line 284 we remove the anchor from the stack:
But the anchor is still a child of the div. Here’s a screenshot of the debug panel in IntelliJ:
I would bet that the element gets “reinserted” because it’s still in the div.