Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CPU borders on full capacity with a simple scraping process

See original GitHub issue

Tell us about your environment:

Puppeteer version: 1.17.0
Platform / OS version: CentOS Linux 7 (Core)
Node.js version: 12.3.1

What steps will reproduce the problem?

I use Puppeteer to set up a quite simple scraper which has the task to scrape meta data (OpenGraph properties or Schema.org properties) of about 500,000 URLs. I process the URLs in parallel in up to 10 sets of 200 URLs each spread over up to 10 browserContexts with one Page each in a single Chromium instance.

I create a browserContext for each set, scrape the URLs and then close the browserContext. This procedure is repeated until all URLs have been processed.

The creation process of one browserContext looks like this: (I manage all contexts in one instance variable, so I can easily reuse them when needed)

const contextName = 'browserContextId'
const context = await browser.createIncognitoBrowserContext()

const pageName = 'pageId'
const page = await context.newPage()
await page.emulate({
	name: 'default',
	userAgent: `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/0;) Safari/537.36`,
	viewport: {
		width: 1366,
		height: 768,
		deviceScaleFactor: 1,
		isMobile: false,
		hasTouch: false,
		isLandscape: true
	}
})

this.contexts[contextName] = context
this.contexts[contextName].pages[pageName] = page

The scraping process for each URL looks like this:

const page = browser.contexts[contextName].pages[pageName]
await page.goto(url)


await page.evaluate(() => {
    let date, description

    let schemaOrg = document.querySelector('script[type="application/ld+json"]')

    if (schemaOrg) {
        schemaOrg = JSON.parse(JSON.stringify(schemaOrg.innerHTML))
        date = schemaOrg.datePublished || schemaOrg.dateModified || schemaOrg.dateCreated
        description = schemaOrg.description
    }

    if (!date) {
        date = document.querySelector('meta[property="article:published_date"]')
        date = date ? date.getAttribute('content') : null
    }

    if (!description) {
        description = document.querySelector('meta[property="og:description"]')
        description = description ? description.getAttribute('content') : null
    }

    return {
        ...(date && { date }),
        ...(description && { description })
    }
})

After the URL set has been processed, I close the browserContext and start again from the beginning

await browser.contexts[contextName].close()
delete browser.contexts[contextName]

What is the expected result?

A resource-saving and performant scraping.

Above all, I no longer know how to optimize my code, especially since I have already tried to

split the scraping between several instances of Chromium,
split the scraping on several Pages in one Chromium instance,
and split the scraping between several browserContexts in one Chromium instance

What happens instead?

I already reach full CPU usage after a few hundred URLs (with an 8-core system)

Please note: The above code does not represent my implementation 1:1. I have specifically tried to make my implementation as simple and comprehensible as possible. I therefore ask for your indulgence for any inconsistencies or minor errors.

Issue Analytics

State:
Created 4 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

6reactions

drmrbrewercommented, Mar 6, 2020

Using the launch options from this post has made a HUUGE difference for me! @aslushnikov why are these launch options never mentioned on here as being something to try, when people have in the past (e.g. here and here and here) have observed that CPU usage appears to be higher than expected? Are there any other launch options that would be useful to add, to minimise CPU usage?

0reactions

stale[bot]commented, Jul 23, 2022

We are closing this issue. If the issue still persists in the latest version of Puppeteer, please reopen the issue and update the description. We will try our best to accomodate it!