Many errors with broad crawl
See original GitHub issueHello,
I’m using scrapy-playwright package to capture screenshot and get html content of 2000 websites, my main code looks simple:
def start_requests(self):
....
yield scrapy.Request(
url=url,
meta={"playwright": True, "playwright_include_page": True},
)
....
async def parse(self, response):
page = response.meta["playwright_page"]
...
await page.screenshot(path=screenshot_file_full_path)
html = await page.content()
await page.close()
...
There are many errors when I ran the script, I change CONCURRENT_REQUESTS
from 30
to 1
but the results was no different.
My test included 2000 websites, but the Scrapy script scraped only 511 results (about 25% successful rate) and the script is running without more results and error logs.
Please guide me to fix this, thanks in advance,
My error logs:
2021-06-22 11:49:53 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.ask.com>
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 824, in adapt
extracted = result.result()
File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy_playwright/handler.py", line 140, in _download_request
result = await self._download_request_with_page(request, spider, page)
File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy_playwright/handler.py", line 160, in _download_request_with_page
response = await page.goto(request.url)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 6006, in goto
await self._async(
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_page.py", line 429, in goto
return await self._main_frame.goto(**locals_to_params(locals()))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_frame.py", line 117, in goto
await self._channel.send("goto", locals_to_params(locals()))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "https://www.ask.com", waiting until "load"
============================================================
Note: use DEBUG=pw:api environment variable to capture Playwright logs.
2021-06-22 11:51:24 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.tvzavr.ru>
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 824, in adapt
extracted = result.result()
File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy_playwright/handler.py", line 140, in _download_request
result = await self._download_request_with_page(request, spider, page)
File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy_playwright/handler.py", line 160, in _download_request_with_page
response = await page.goto(request.url)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 6006, in goto
await self._async(
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_page.py", line 429, in goto
return await self._main_frame.goto(**locals_to_params(locals()))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_frame.py", line 117, in goto
await self._channel.send("goto", locals_to_params(locals()))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "https://www.tvzavr.ru", waiting until "load"
============================================================
Note: use DEBUG=pw:api environment variable to capture Playwright logs.
.....
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11634' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
await self._async(
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
await self._channel.send("continue", cast(Any, overrides))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11640' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
await self._async(
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
await self._channel.send("continue", cast(Any, overrides))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11641' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
await self._async(
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
await self._channel.send("continue", cast(Any, overrides))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11652' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
await self._async(
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
await self._channel.send("continue", cast(Any, overrides))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11657' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
await self._async(
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
await self._channel.send("continue", cast(Any, overrides))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11669' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
await self._async(
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
await self._channel.send("continue", cast(Any, overrides))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11670' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
await self._async(
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
await self._channel.send("continue", cast(Any, overrides))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11671' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
await self._async(
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
await self._channel.send("continue", cast(Any, overrides))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
.....
My Scrapy settings looks like:
CONCURRENT_REQUESTS = 30
...
# Playwright settings
PLAYWRIGHT_CONTEXT_ARGS = {'ignore_https_errors':True}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 30000
...
RETRY_ENABLED = True
RETRY_TIMES = 3
My env:
Ubuntu 20.04 and MacOS 11.2.3
Python 3.8.5
Scrapy 2.5.0
playwright 1.12.1
scrapy-playwright 0.0.3
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (2 by maintainers)
Top Results From Across the Web
Broad Crawls — Scrapy 2.7.1 documentation
These are some common properties often found in broad crawls: they crawl many domains (often, unbounded) instead of a specific set of sites....
Read more >Performing a Scrapy broad crawl with high concurrency and ...
I am trying to make a Scrapy broad crawl. The goal is to have many concurrent crawls at different domains but at the...
Read more >What Are Crawl Errors? Why Do Crawl Errors Matter?
Crawl errors are issues that crawlers encounter while trying to access your pages. They can be URL-specific or cause your entire website to...
Read more >A Practical Guide To Web Data QA Part V: Broad Crawls - Zyte
Unexpected problems are common for broad crawls and one of the ways to deal with them is by using deductive reasoning and analysis....
Read more >4 Common Crawl Errors and Why You Need to Fix Them
If the errors are serious enough, that could lead to a whole section of your pages not getting crawled and some pages getting...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It’s been a while, but I think I understand what’s happening now: #74.
Hi @Obeyed thanks for share your test. I also tested on
multiple-contexts
branch to create new context per domain on about 2500 urls and the errorIneffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
is absolutely a bug.