Scrape Page with Cloudflare Rocket Loader Running Javascript
See original GitHub issueThe page that I’m trying to scrape uses Cloudflare Rocket Loader to execute Javascript, and there is one script in particular that retrieves articles in consecutive 15-article chunks, observable in the browser. I’d like to mimic this using HTMLUnit, run the scripts and afterwards capture all 200 articles from the loaded page.
(Version 2.42.0)
public class Test {
public static void main(String[] args) throws IOException {
int count = 0;
try (WebClient webClient = new WebClient(BrowserVersion.FIREFOX_68)) {
webClient.getCache().clear();
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setDownloadImages(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.setCssErrorHandler(new SilentCssErrorHandler());
final HtmlPage page = webClient.getPage("https://www.mdpi.com/search?sort=pubdate&page_count=200&year_from=1996&year_to=2021&journal=applsci&view=compact");
webClient.waitForBackgroundJavaScriptStartingBefore(10000);
//get all 200 articles
final List<HtmlDivision> divs = page.getByXPath("//div[contains(@class, 'generic-item article-item')]");
for (HtmlDivision div : divs) {
System.out.println(count++ + "\t" + div.getTextContent());
}
System.out.println("Found " + count + " of an expected 200 articles");
}
}
}
Running this yields only 15 of the desired 200 articles, and errors such as the following:
May 12, 2021 1:14:42 PM com.gargoylesoftware.htmlunit.html.HtmlScript isExecutionNeeded
WARNING: Script is not JavaScript (type: '4eca6d626edb4bb11e74dead-text/javascript', language: ''). Skipping execution.
So the Javascript is not run, due to not being recognized as Javascript, and in particular the type attribute (‘4eca6d626edb4bb11e74dead-text/javascript’) seems to be tripping up that recognition.
One twist is that the same code with Version 2.26 and BrowserVersion.FIREFOX_45 seems to ignore Rocket Loader, executes the Javascript as expected, and yields 200 articles.
Any help or insight would be greatly appreciated.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
there is a new snapshot available
Thanks enjoy using HtmlUnit.