question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrape Page with Cloudflare Rocket Loader Running Javascript

See original GitHub issue

The page that I’m trying to scrape uses Cloudflare Rocket Loader to execute Javascript, and there is one script in particular that retrieves articles in consecutive 15-article chunks, observable in the browser. I’d like to mimic this using HTMLUnit, run the scripts and afterwards capture all 200 articles from the loaded page.

(Version 2.42.0)

public class Test {
    public static void main(String[] args) throws IOException {
        int count = 0;
        try (WebClient webClient = new WebClient(BrowserVersion.FIREFOX_68)) {
            webClient.getCache().clear();
            webClient.getOptions().setUseInsecureSSL(true);
            webClient.getOptions().setCssEnabled(false);
            webClient.getOptions().setJavaScriptEnabled(true);
            webClient.getOptions().setRedirectEnabled(true);
            webClient.getOptions().setDownloadImages(false);
            webClient.getCookieManager().setCookiesEnabled(true);
            webClient.getOptions().setThrowExceptionOnScriptError(true);
            webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
            webClient.setAjaxController(new NicelyResynchronizingAjaxController());
            webClient.setCssErrorHandler(new SilentCssErrorHandler());
            final HtmlPage page = webClient.getPage("https://www.mdpi.com/search?sort=pubdate&page_count=200&year_from=1996&year_to=2021&journal=applsci&view=compact");
            webClient.waitForBackgroundJavaScriptStartingBefore(10000);

            //get all 200 articles
            final List<HtmlDivision> divs = page.getByXPath("//div[contains(@class, 'generic-item article-item')]");
            for (HtmlDivision div : divs) {
                System.out.println(count++ + "\t" + div.getTextContent());
            }
            System.out.println("Found " + count + " of an expected 200 articles");
        }
    }
}

Running this yields only 15 of the desired 200 articles, and errors such as the following:

May 12, 2021 1:14:42 PM com.gargoylesoftware.htmlunit.html.HtmlScript isExecutionNeeded
WARNING: Script is not JavaScript (type: '4eca6d626edb4bb11e74dead-text/javascript', language: ''). Skipping execution.

So the Javascript is not run, due to not being recognized as Javascript, and in particular the type attribute (‘4eca6d626edb4bb11e74dead-text/javascript’) seems to be tripping up that recognition.

One twist is that the same code with Version 2.26 and BrowserVersion.FIREFOX_45 seems to ignore Rocket Loader, executes the Javascript as expected, and yields 200 articles.

Any help or insight would be greatly appreciated.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
rbricommented, May 14, 2021

there is a new snapshot available

0reactions
rbricommented, May 14, 2021

Thanks enjoy using HtmlUnit.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cloudflare's javascripts slowing my site - Performance
Hi. I have a wordpress site at http://biblicomentarios.com . No matter what I do, I can't remove two javascripts from Cloudflare.
Read more >
How Does Cloudflare's Rocket Loader Work 2020 - YouTube
How Does Cloudflare's Rocket Loader Work 2020 - Rocket Loader JS Page Speed Insights - Cloudflare · Cloudflare's Rocket Loader can get a...
Read more >
HtmlUnit on Twitter: "Time for the next Snapshot, again Rhino made ...
The page that I'm trying to scrape uses Cloudflare Rocket Loader to execute Javascript, and there is one script in particular that retrieves...
Read more >
Prevent CloudFlare from Loading a js or Script
Sometimes CloudFlare / Rocket Loader can have problems with a script and hose it up. To have CloudFlare ignore a script try using...
Read more >
Best Cloudflare settings for WordPress (turbocharged at the ...
The feature also sometimes adds additional JavaScript files containing the following: /cdn-cgi/bm/cv/ and /cdn-cgi/challenge-platform/h/g/scripts/pica.js . See ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found