Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Misinformation of getByXPath method

See original GitHub issue

JavaDoc states: “Evaluates the specified XPath expression from this node, returning the matching elements.”

That implies, that with given html

<html>
<body>
    <h1>Wrong</h1>
    <div>
        <h1>Right</h1>
    </div>
<body>
</html>

and selected div node, in order to select child h1 node we need to pass xpath as //h1. But that’s not the case, we need to select current node first with a dot selector, so correct xpath is .//h1. While it is proper xpath, I’d argue, that JavaDoc implies, that the node is already selected. It’s specially confusing, if you print node as xml and try to validate your xpath via third party tools.

It’s a bit against common sense, that selected node does not traverse from it’s location. I do not expect change in the code, but more specific JavaDoc would be definitely helpful.

Issue Analytics

State:
Created 4 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

rbricommented, Jul 9, 2019

Please have a look at the commit, hope the updated docu is a bit more clear. Many thanks for the report and the discussion of all the details. Enjoy using HtmlUnit

0reactions

TomasTokaMrazekcommented, Jul 7, 2019

Apologies for delay, I was on vacation.

@DSantiagoBC

public void getByXPathSelectedNode() throws Exception {
        WebClient client = new WebClient();

        final String htmlContent = "<html>\n"
                + "  <head>\n"
                + "    <title>my title</title>\n"
                + "  </head>"
                + "  <body>\n"
                + "    <h1>Heading!</h1>\n"
                + "    <div id='d1'>\n"
                + "      <h1 id='h1'>HtmlUnit</h1>\n"
                + "    </div>\n"
                + "  </body>\n"
                + "</html>";

        StringWebResponse response = new StringWebResponse(htmlContent,
                new URL("http://htmlunit.sourceforge.net//test.html"));

        HtmlPage page = HTMLParser.parseHtml(response, client.getCurrentWindow());

        final HtmlDivision divNode = (HtmlDivision) page.getElementById("d1");

        log.debug("Xpath: {}", divNode.getByXPath("//h1").get(0));
        log.debug("Xpath: {}", divNode.getByXPath(".//h1").get(0));

        client.close();
    }

Result:

19:16:12.149 [main] [DEBUG] cz.jaktoviditoka.investmentportfolio.model.HtmlUnitTest - Xpath: HtmlHeading1[<h1>]
19:16:12.149 [main] [DEBUG] cz.jaktoviditoka.investmentportfolio.model.HtmlUnitTest - Xpath: HtmlHeading1[<h1 id="h1">]

@mguillem I don’t see a single usecase, where you want to select child object from html page and call xpath traverse method, which searches the whole page including parents. Why exactly would I select some div node from HtmlPage and then call getByXPath on this div in order to search whole HtmlPage? From Java OOP standpoint, I should call the method on the original HtmlPage object. Java is not command line.

I personally think, that the dot in xpath should be implicit, not explicit. But as I said that would bring compatibility issues, so I’ll settle with better docs. I literally spent hours trying to figure out, why getByXPath on child node searches whole tree. But that might be due to my WSO2 EI background, where the dot is basically never used.