question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

API to retrieve items from execution

See original GitHub issue

In order to execute from script and retrieving individual items, I’ve used the following snippet.

Is there a better way to do that? Also, I wondered if it would be incorporated in the library (probably changing scrapy.crawler:CrawlerProcess).

import importlib
import multiprocessing

from scrapy import signals, optional_features


def scrape_items(timeout, settings, spidercls, spiderkwargs):
    """Runs Scrapy on an isolated process.

    Usage:
    import logging
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

    from myproject import MySpider

    timeout = 10
    settings = {}
    spiderkwargs = {'domain': 'scrapy.org'}
    for item in scraper.scrape_items(timeout, settings, MySpider, spiderkwargs):
        print(repr(item))
    """
    timeout = int(timeout)
    if timeout <= 0:
        raise ValueError('timeout must be greater than zero')

    items_queue = multiprocessing.Queue()
    p = multiprocessing.Process(target=_scraper_callback, args=(items_queue, timeout, settings, spidercls, spiderkwargs,))
    p.start()

    while True:
        try:
            queue_item = items_queue.get(timeout=5)
            if not isinstance(queue_item, tuple) or len(queue_item) < 1 or not isinstance(queue_item[0], six.string_types):
                continue
            elif queue_item[0] == 'ITEM':
                yield queue_item[1]
                continue
            elif queue_item[0] == 'FINISHED-SUCCESS':
                logging.info('Finished crawling.')
                logging.info(repr(queue_item[1])) # stats
                break
            elif queue_item[0] == 'FINISHED-ERROR':
                raise queue_item[1] # to be handled below
        except multiprocessing.queues.Empty:
            continue # try again
        except Exception as e:
            exc_type, exc_obj, exc_tb = sys.exc_info()
            fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
            exception_str = "%s:%s %s" % (fname, exc_tb.tb_lineno, repr(e))
            logging.info('Error crawling')
            logging.info(exception_str)
            break
    p.join() # ensure process finished its work and queue is empty
    return


def _scraper_callback(items_queue, timeout, settings, spidercls, spiderkwargs):
    """Auxiliary. Important: this function run in the context of a sub-process."""

    optional_features.remove('boto') # see https://github.com/scrapy/scrapy/issues/1099

    # instantiate crawler and spider
    try:
        crawler = Crawler(spidercls, settings)

        # connect signals
        def handle_item(item):
            items_queue.put(('ITEM', item,))
        crawler.signals.connect(handle_item, signal=signals.item_scraped)
        crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

        # run scrapy
        crawler.crawl(**spiderkwargs)
        reactor.run()
        try:
            stats = crawler.stats.get_stats()
        except:
            stats = {}
        items_queue.put(('FINISHED-SUCCESS', stats))
    except Exception as e:
        # any exception on process will make the calling function to return
        items_queue.put(('FINISHED-ERROR', e))

Issue Analytics

  • State:open
  • Created 8 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
stavcommented, Mar 29, 2016

Also along these lines of getting a hold of items in memory is also the idea of scraping items from somewhere other than the parse callback chain or pipelines, like extensions for example.

In the past I have had to resort to this little “we’re all consenting adults” use of a private method:

def add_item(self, sitemap, spider):
    request = response = None
    scraper = self.crawler.engine.scraper
    item = myproject.items.SitemapItem(dict(sitemap=sitemap))
    scraper._process_spidermw_output(item, request, response, spider)
0reactions
kmikecommented, Oct 6, 2015

ItemCursor is quite general: it has both crawl_d and crawler as attributes, so if we add a CrawlerRunner method which returns ItemCursor it can be used instead as a more powerful version of crawler_runner.crawl(spider). A better name for ItemCursor could be good: Crawl? Scrape? CrawlInfo? Not perfect, it should be possible to come up with a better name.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How To Use the JavaScript Fetch API to Get Data - DigitalOcean
This tutorial will retrieve data from the JSONPlaceholder API and display it in list items inside the author's list.
Read more >
Workflow Executions API - Google Cloud
Stay organized with collections Save and categorize content based on your preferences. Execute workflows created with Workflows API.
Read more >
Retrieving Execution Plans -- API - BMC Documentation
Parameter Description Value executionPlanId ID of the Execution Plan String representing a numeric value daysOfRunTillHour End time for the blackout period Format: hh:mm. Example: 21:00 daysOfRunFromHour...
Read more >
REST API for Oracle Integration - Retrieve Integrations
The following examples show how to retrieve details about integrations by submitting a GET request on the REST resource using cURL.
Read more >
Data management package REST API - Finance & Operations
The execution ID of the import. This is called as Job ID in the UI. string entityName, The name of the entity to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found