question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

some websites block puppeteer access and show blank ad page

See original GitHub issue

Steps to reproduce

Tell us about your environment:

  • Puppeteer version: 5.3.0
  • Platform / OS version: Ubuntu 18.04
  • URLs (if applicable):
  • Node.js version: v12.6.0

What steps will reproduce the problem?


var fs = require("fs");

const puppeteer = require('puppeteer-extra')
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())

const repl = require('puppeteer-extra-plugin-repl')({ addToPuppeteerClass: false })
puppeteer.use(repl)

var sleep = require('sleep');

(async () => {

  const browser = await puppeteer.launch({
    

    headless: false,
    ignoreHTTPSErrors: true,

    args: [
        '--lang=en-US,en;q=0.9', 
        // '--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3419.0 Safari/537.36"', 
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-infobars',
        '--window-position=0,0',
        '--ignore-certifcate-errors',
        '--ignore-certifcate-errors-spki-list',
    ], 

    ignoreHTTPSErrors: true,
    userDataDir: './tmp', 

  });


  const page = await browser.newPage();


  const preloadFile = fs.readFileSync('./preload.js', 'utf8');
  await page.evaluateOnNewDocument(preloadFile);



  await page.setViewport({width: 1200, height: 720})

  await page.setDefaultNavigationTimeout(0);

  const navigationPromise = page.waitForNavigation()
  

  await Promise.all([
    page.goto('https://www.coingecko.com/en/coins/bitcoin', {timeout: 60000}), 
    page.waitForNavigation({ waitUntil: 'networkidle0' }),
  ]);
  await sleep.sleep(5)

  await Promise.all([
    page.goto('https://www.coingecko.com/en/coins/ethereum', {timeout: 60000}), 
    page.waitForNavigation({ waitUntil: 'networkidle0' }),
  ]);
  await sleep.sleep(5)


  await repl.repl(page)
  await sleep.sleep(5)

})();

What is the expected result?

normal behavior with no auto-redirection or such.

What happens instead?

some (very popular) websites (I tried tweetdeck.twitter.com; reddit.com; coingecko.com) block (not sure really, so tell me what this actually is) puppeteer browser access, and they show me some blank page with an ad instead, as example screenshot image shown below. more technically, they show me a normal webpage once for less than few seconds but after that they auto-redirect to a blank page. You can return back to the normal page by doing back button on your mouse manually and if that case you can see the normal webpage in that session. also, while that time, goto a URL to auto-redirected to a blank page, puppeteer was unable to scrape any document in the page.

https://yuis.xsrv.jp/images/ss/ShareX_ScreenShot_fc9a4c49-6350-4051-84b4-8362070eb95d.png

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:8

github_iconTop GitHub Comments

1reaction
maheshjukanticommented, Oct 16, 2020

You have used puppeteer extra and it’s plugins. But you have posted it in the puppeteer library.

May be stealth plugin is blocked. Check your script without stealth plugin enabled.

0reactions
OrKoNcommented, Sep 5, 2022

If the sites choose to block Puppeteer, there is not much we can do. As other suggested, try using the stealth plugin and perhaps it helps.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Crawling protected sites with Puppeteer - node.js
I am trying to crawl some sites, e.g. http://www.faintinggoatdc.com/food/dinner-menu/ with puppeteer, but I am still blocked.
Read more >
How To Scrape a Website Using Node.js and Puppeteer
In this tutorial, you will build a web scraping application using Node.js and Puppeteer. Your app will grow in complexity as you progress....
Read more >
Specific website just won't load at all with puppeteer - Reddit
I get a completely blank page and in the inspector I just get a 429 response. Interestingly though, if I go to the...
Read more >
Web Scraping with Puppeteer & Node.js: Chrome Automation
My premium courses: https://learnwebcode.com/courses/0:00 Intro1:20 Installing Puppeteer4:29 Taking a Screenshot7:09 Scraping Text From ...
Read more >
How Javascript is Used to Block Web Scrapers? In-Depth Guide
Javascript in the browser can access thousands of different ... These variables can be read by any website making identification of robots ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found