CPU borders on full capacity with a simple scraping process
See original GitHub issueTell us about your environment:
- Puppeteer version: 1.17.0
- Platform / OS version: CentOS Linux 7 (Core)
- Node.js version: 12.3.1
What steps will reproduce the problem?
I use Puppeteer to set up a quite simple scraper which has the task to scrape meta data (OpenGraph properties or Schema.org properties) of about 500,000 URLs.
I process the URLs in parallel in up to 10 sets of 200 URLs each spread over up to 10 browserContexts
with one Page
each in a single Chromium instance.
I create a browserContext
for each set, scrape the URLs and then close the browserContext
. This procedure is repeated until all URLs have been processed.
The creation process of one browserContext
looks like this:
(I manage all contexts in one instance variable, so I can easily reuse them when needed)
const contextName = 'browserContextId'
const context = await browser.createIncognitoBrowserContext()
const pageName = 'pageId'
const page = await context.newPage()
await page.emulate({
name: 'default',
userAgent: `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/0;) Safari/537.36`,
viewport: {
width: 1366,
height: 768,
deviceScaleFactor: 1,
isMobile: false,
hasTouch: false,
isLandscape: true
}
})
this.contexts[contextName] = context
this.contexts[contextName].pages[pageName] = page
The scraping process for each URL looks like this:
const page = browser.contexts[contextName].pages[pageName]
await page.goto(url)
await page.evaluate(() => {
let date, description
let schemaOrg = document.querySelector('script[type="application/ld+json"]')
if (schemaOrg) {
schemaOrg = JSON.parse(JSON.stringify(schemaOrg.innerHTML))
date = schemaOrg.datePublished || schemaOrg.dateModified || schemaOrg.dateCreated
description = schemaOrg.description
}
if (!date) {
date = document.querySelector('meta[property="article:published_date"]')
date = date ? date.getAttribute('content') : null
}
if (!description) {
description = document.querySelector('meta[property="og:description"]')
description = description ? description.getAttribute('content') : null
}
return {
...(date && { date }),
...(description && { description })
}
})
After the URL set has been processed, I close the browserContext
and start again from the beginning
await browser.contexts[contextName].close()
delete browser.contexts[contextName]
What is the expected result?
A resource-saving and performant scraping.
Above all, I no longer know how to optimize my code, especially since I have already tried to
- split the scraping between several instances of Chromium,
- split the scraping on several
Pages
in one Chromium instance, - and split the scraping between several
browserContexts
in one Chromium instance
What happens instead?
I already reach full CPU usage after a few hundred URLs (with an 8-core system)
Please note: The above code does not represent my implementation 1:1. I have specifically tried to make my implementation as simple and comprehensible as possible. I therefore ask for your indulgence for any inconsistencies or minor errors.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (2 by maintainers)
Top GitHub Comments
Using the launch options from this post has made a HUUGE difference for me! @aslushnikov why are these launch options never mentioned on here as being something to try, when people have in the past (e.g. here and here and here) have observed that CPU usage appears to be higher than expected? Are there any other launch options that would be useful to add, to minimise CPU usage?
We are closing this issue. If the issue still persists in the latest version of Puppeteer, please reopen the issue and update the description. We will try our best to accomodate it!