Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

API to retrieve items from execution

See original GitHub issue

In order to execute from script and retrieving individual items, I’ve used the following snippet.

Is there a better way to do that? Also, I wondered if it would be incorporated in the library (probably changing scrapy.crawler:CrawlerProcess).

import importlib
import multiprocessing

from scrapy import signals, optional_features


def scrape_items(timeout, settings, spidercls, spiderkwargs):
    """Runs Scrapy on an isolated process.

    Usage:
    import logging
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

    from myproject import MySpider

    timeout = 10
    settings = {}
    spiderkwargs = {'domain': 'scrapy.org'}
    for item in scraper.scrape_items(timeout, settings, MySpider, spiderkwargs):
        print(repr(item))
    """
    timeout = int(timeout)
    if timeout <= 0:
        raise ValueError('timeout must be greater than zero')

    items_queue = multiprocessing.Queue()
    p = multiprocessing.Process(target=_scraper_callback, args=(items_queue, timeout, settings, spidercls, spiderkwargs,))
    p.start()

    while True:
        try:
            queue_item = items_queue.get(timeout=5)
            if not isinstance(queue_item, tuple) or len(queue_item) < 1 or not isinstance(queue_item[0], six.string_types):
                continue
            elif queue_item[0] == 'ITEM':
                yield queue_item[1]
                continue
            elif queue_item[0] == 'FINISHED-SUCCESS':
                logging.info('Finished crawling.')
                logging.info(repr(queue_item[1])) # stats
                break
            elif queue_item[0] == 'FINISHED-ERROR':
                raise queue_item[1] # to be handled below
        except multiprocessing.queues.Empty:
            continue # try again
        except Exception as e:
            exc_type, exc_obj, exc_tb = sys.exc_info()
            fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
            exception_str = "%s:%s %s" % (fname, exc_tb.tb_lineno, repr(e))
            logging.info('Error crawling')
            logging.info(exception_str)
            break
    p.join() # ensure process finished its work and queue is empty
    return


def _scraper_callback(items_queue, timeout, settings, spidercls, spiderkwargs):
    """Auxiliary. Important: this function run in the context of a sub-process."""

    optional_features.remove('boto') # see https://github.com/scrapy/scrapy/issues/1099

    # instantiate crawler and spider
    try:
        crawler = Crawler(spidercls, settings)

        # connect signals
        def handle_item(item):
            items_queue.put(('ITEM', item,))
        crawler.signals.connect(handle_item, signal=signals.item_scraped)
        crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

        # run scrapy
        crawler.crawl(**spiderkwargs)
        reactor.run()
        try:
            stats = crawler.stats.get_stats()
        except:
            stats = {}
        items_queue.put(('FINISHED-SUCCESS', stats))
    except Exception as e:
        # any exception on process will make the calling function to return
        items_queue.put(('FINISHED-ERROR', e))

Issue Analytics

State:
Created 8 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

stavcommented, Mar 29, 2016

Also along these lines of getting a hold of items in memory is also the idea of scraping items from somewhere other than the parse callback chain or pipelines, like extensions for example.

In the past I have had to resort to this little “we’re all consenting adults” use of a private method:

def add_item(self, sitemap, spider):
    request = response = None
    scraper = self.crawler.engine.scraper
    item = myproject.items.SitemapItem(dict(sitemap=sitemap))
    scraper._process_spidermw_output(item, request, response, spider)

0reactions

kmikecommented, Oct 6, 2015

ItemCursor is quite general: it has both crawl_d and crawler as attributes, so if we add a CrawlerRunner method which returns ItemCursor it can be used instead as a more powerful version of crawler_runner.crawl(spider). A better name for ItemCursor could be good: Crawl? Scrape? CrawlInfo? Not perfect, it should be possible to come up with a better name.

Top Results From Across the Web

How To Use the JavaScript Fetch API to Get Data - DigitalOcean

This tutorial will retrieve data from the JSONPlaceholder API and display it in list items inside the author's list.

Workflow Executions API - Google Cloud

Stay organized with collections Save and categorize content based on your preferences. Execute workflows created with Workflows API.

Retrieving Execution Plans -- API - BMC Documentation

Parameter Description Value executionPlanId ID of the Execution Plan String representing a numeric value daysOfRunTillHour End time for the blackout period Format: hh:mm. Example: 21:00 daysOfRunFromHour...

REST API for Oracle Integration - Retrieve Integrations

The following examples show how to retrieve details about integrations by submitting a GET request on the REST resource using cURL.

Data management package REST API - Finance & Operations

The execution ID of the import. This is called as Job ID in the UI. string entityName, The name of the entity to...