API to retrieve items from execution
See original GitHub issueIn order to execute from script and retrieving individual items, I’ve used the following snippet.
Is there a better way to do that? Also, I wondered if it would be incorporated in the library (probably changing scrapy.crawler:CrawlerProcess).
import importlib
import multiprocessing
from scrapy import signals, optional_features
def scrape_items(timeout, settings, spidercls, spiderkwargs):
"""Runs Scrapy on an isolated process.
Usage:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from myproject import MySpider
timeout = 10
settings = {}
spiderkwargs = {'domain': 'scrapy.org'}
for item in scraper.scrape_items(timeout, settings, MySpider, spiderkwargs):
print(repr(item))
"""
timeout = int(timeout)
if timeout <= 0:
raise ValueError('timeout must be greater than zero')
items_queue = multiprocessing.Queue()
p = multiprocessing.Process(target=_scraper_callback, args=(items_queue, timeout, settings, spidercls, spiderkwargs,))
p.start()
while True:
try:
queue_item = items_queue.get(timeout=5)
if not isinstance(queue_item, tuple) or len(queue_item) < 1 or not isinstance(queue_item[0], six.string_types):
continue
elif queue_item[0] == 'ITEM':
yield queue_item[1]
continue
elif queue_item[0] == 'FINISHED-SUCCESS':
logging.info('Finished crawling.')
logging.info(repr(queue_item[1])) # stats
break
elif queue_item[0] == 'FINISHED-ERROR':
raise queue_item[1] # to be handled below
except multiprocessing.queues.Empty:
continue # try again
except Exception as e:
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
exception_str = "%s:%s %s" % (fname, exc_tb.tb_lineno, repr(e))
logging.info('Error crawling')
logging.info(exception_str)
break
p.join() # ensure process finished its work and queue is empty
return
def _scraper_callback(items_queue, timeout, settings, spidercls, spiderkwargs):
"""Auxiliary. Important: this function run in the context of a sub-process."""
optional_features.remove('boto') # see https://github.com/scrapy/scrapy/issues/1099
# instantiate crawler and spider
try:
crawler = Crawler(spidercls, settings)
# connect signals
def handle_item(item):
items_queue.put(('ITEM', item,))
crawler.signals.connect(handle_item, signal=signals.item_scraped)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
# run scrapy
crawler.crawl(**spiderkwargs)
reactor.run()
try:
stats = crawler.stats.get_stats()
except:
stats = {}
items_queue.put(('FINISHED-SUCCESS', stats))
except Exception as e:
# any exception on process will make the calling function to return
items_queue.put(('FINISHED-ERROR', e))
Issue Analytics
- State:
- Created 8 years ago
- Comments:11 (11 by maintainers)
Top Results From Across the Web
How To Use the JavaScript Fetch API to Get Data - DigitalOcean
This tutorial will retrieve data from the JSONPlaceholder API and display it in list items inside the author's list.
Read more >Workflow Executions API - Google Cloud
Stay organized with collections Save and categorize content based on your preferences. Execute workflows created with Workflows API.
Read more >Retrieving Execution Plans -- API - BMC Documentation
Parameter Description Value
executionPlanId ID of the Execution Plan String representing a numeric value
daysOfRunTillHour End time for the blackout period Format: hh:mm. Example: 21:00
daysOfRunFromHour...
Read more >REST API for Oracle Integration - Retrieve Integrations
The following examples show how to retrieve details about integrations by submitting a GET request on the REST resource using cURL.
Read more >Data management package REST API - Finance & Operations
The execution ID of the import. This is called as Job ID in the UI. string entityName, The name of the entity to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Also along these lines of getting a hold of items in memory is also the idea of scraping items from somewhere other than the parse callback chain or pipelines, like extensions for example.
In the past I have had to resort to this little “we’re all consenting adults” use of a private method:
ItemCursor is quite general: it has both crawl_d and crawler as attributes, so if we add a CrawlerRunner method which returns ItemCursor it can be used instead as a more powerful version of crawler_runner.crawl(spider). A better name for ItemCursor could be good: Crawl? Scrape? CrawlInfo? Not perfect, it should be possible to come up with a better name.