question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CPU borders on full capacity with a simple scraping process

See original GitHub issue

Tell us about your environment:

  • Puppeteer version: 1.17.0
  • Platform / OS version: CentOS Linux 7 (Core)
  • Node.js version: 12.3.1

What steps will reproduce the problem?

I use Puppeteer to set up a quite simple scraper which has the task to scrape meta data (OpenGraph properties or Schema.org properties) of about 500,000 URLs. I process the URLs in parallel in up to 10 sets of 200 URLs each spread over up to 10 browserContexts with one Page each in a single Chromium instance.

I create a browserContext for each set, scrape the URLs and then close the browserContext. This procedure is repeated until all URLs have been processed.

The creation process of one browserContext looks like this: (I manage all contexts in one instance variable, so I can easily reuse them when needed)

const contextName = 'browserContextId'
const context = await browser.createIncognitoBrowserContext()

const pageName = 'pageId'
const page = await context.newPage()
await page.emulate({
	name: 'default',
	userAgent: `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/0;) Safari/537.36`,
	viewport: {
		width: 1366,
		height: 768,
		deviceScaleFactor: 1,
		isMobile: false,
		hasTouch: false,
		isLandscape: true
	}
})

this.contexts[contextName] = context
this.contexts[contextName].pages[pageName] = page

The scraping process for each URL looks like this:

const page = browser.contexts[contextName].pages[pageName]
await page.goto(url)


await page.evaluate(() => {
    let date, description

    let schemaOrg = document.querySelector('script[type="application/ld+json"]')

    if (schemaOrg) {
        schemaOrg = JSON.parse(JSON.stringify(schemaOrg.innerHTML))
        date = schemaOrg.datePublished || schemaOrg.dateModified || schemaOrg.dateCreated
        description = schemaOrg.description
    }

    if (!date) {
        date = document.querySelector('meta[property="article:published_date"]')
        date = date ? date.getAttribute('content') : null
    }

    if (!description) {
        description = document.querySelector('meta[property="og:description"]')
        description = description ? description.getAttribute('content') : null
    }

    return {
        ...(date && { date }),
        ...(description && { description })
    }
})

After the URL set has been processed, I close the browserContext and start again from the beginning

await browser.contexts[contextName].close()
delete browser.contexts[contextName]

What is the expected result?

A resource-saving and performant scraping.

Above all, I no longer know how to optimize my code, especially since I have already tried to

  • split the scraping between several instances of Chromium,
  • split the scraping on several Pages in one Chromium instance,
  • and split the scraping between several browserContexts in one Chromium instance

What happens instead?

I already reach full CPU usage after a few hundred URLs (with an 8-core system)

Please note: The above code does not represent my implementation 1:1. I have specifically tried to make my implementation as simple and comprehensible as possible. I therefore ask for your indulgence for any inconsistencies or minor errors.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (2 by maintainers)

github_iconTop GitHub Comments

6reactions
drmrbrewercommented, Mar 6, 2020

Using the launch options from this post has made a HUUGE difference for me! @aslushnikov why are these launch options never mentioned on here as being something to try, when people have in the past (e.g. here and here and here) have observed that CPU usage appears to be higher than expected? Are there any other launch options that would be useful to add, to minimise CPU usage?

0reactions
stale[bot]commented, Jul 23, 2022

We are closing this issue. If the issue still persists in the latest version of Puppeteer, please reopen the issue and update the description. We will try our best to accomodate it!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Web Scraping Speed: Processes, Threads and Async - ScrapFly
We'll cover what are CPU-bound and IO-bound tasks, how we can optimize for them using processes, threads and asyncio to speed up our...
Read more >
Evaluating tools and techniques for web scraping - DiVA Portal
The purpose of this thesis is to evaluate state of the art web scraping tools. To support the process, an evaluation framework to...
Read more >
python - Is it possible to speed up a web scraper with multiple ...
Is there any way to speed up a web-scraper by having multiple computers contribute to processing a list of urls? Like computer A...
Read more >
Cache performance Miss-oriented Approach to Memory Access
Typically, CPU fetches 2 blocks on a miss: the requested block and the next consecutive block. Requested block is placed in instruction cache...
Read more >
What is a good speed for a webscraper?
How much work your machine has to do to process each scraped page and record the information you extracted. If you are running...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found