question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Many errors with broad crawl

See original GitHub issue

Hello,

I’m using scrapy-playwright package to capture screenshot and get html content of 2000 websites, my main code looks simple:

def start_requests(self):
....
       yield scrapy.Request(
             url=url,
             meta={"playwright": True, "playwright_include_page": True},
       )
....
async def parse(self, response):
       page = response.meta["playwright_page"]
       ...
       await page.screenshot(path=screenshot_file_full_path)
       html = await page.content()
       await page.close()
       ...

There are many errors when I ran the script, I change CONCURRENT_REQUESTS from 30 to 1 but the results was no different.

My test included 2000 websites, but the Scrapy script scraped only 511 results (about 25% successful rate) and the script is running without more results and error logs.

Please guide me to fix this, thanks in advance,

My error logs:

2021-06-22 11:49:53 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.ask.com>
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 824, in adapt
    extracted = result.result()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy_playwright/handler.py", line 140, in _download_request
    result = await self._download_request_with_page(request, spider, page)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy_playwright/handler.py", line 160, in _download_request_with_page
    response = await page.goto(request.url)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 6006, in goto
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_page.py", line 429, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_frame.py", line 117, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "https://www.ask.com", waiting until "load"
============================================================
Note: use DEBUG=pw:api environment variable to capture Playwright logs.
2021-06-22 11:51:24 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.tvzavr.ru>
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 824, in adapt
    extracted = result.result()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy_playwright/handler.py", line 140, in _download_request
    result = await self._download_request_with_page(request, spider, page)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy_playwright/handler.py", line 160, in _download_request_with_page
    response = await page.goto(request.url)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 6006, in goto
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_page.py", line 429, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_frame.py", line 117, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "https://www.tvzavr.ru", waiting until "load"
============================================================
Note: use DEBUG=pw:api environment variable to capture Playwright logs.
.....
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11634' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11640' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11641' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11652' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11657' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11669' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11670' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11671' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
.....

My Scrapy settings looks like:

CONCURRENT_REQUESTS = 30
...
# Playwright settings
PLAYWRIGHT_CONTEXT_ARGS = {'ignore_https_errors':True}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 30000
...
RETRY_ENABLED = True
RETRY_TIMES = 3

My env:

Ubuntu 20.04 and MacOS 11.2.3
Python 3.8.5
Scrapy 2.5.0
playwright 1.12.1
scrapy-playwright 0.0.3

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
elacuestacommented, Mar 26, 2022

It’s been a while, but I think I understand what’s happening now: #74.

0reactions
phongtnitcommented, Jul 14, 2021

Hi @Obeyed thanks for share your test. I also tested on multiple-contexts branch to create new context per domain on about 2500 urls and the error Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory is absolutely a bug.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Broad Crawls — Scrapy 2.7.1 documentation
These are some common properties often found in broad crawls: they crawl many domains (often, unbounded) instead of a specific set of sites....
Read more >
Performing a Scrapy broad crawl with high concurrency and ...
I am trying to make a Scrapy broad crawl. The goal is to have many concurrent crawls at different domains but at the...
Read more >
What Are Crawl Errors? Why Do Crawl Errors Matter?
Crawl errors are issues that crawlers encounter while trying to access your pages. They can be URL-specific or cause your entire website to...
Read more >
A Practical Guide To Web Data QA Part V: Broad Crawls - Zyte
Unexpected problems are common for broad crawls and one of the ways to deal with them is by using deductive reasoning and analysis....
Read more >
4 Common Crawl Errors and Why You Need to Fix Them
If the errors are serious enough, that could lead to a whole section of your pages not getting crawled and some pages getting...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found