Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve Testability of Scrapy ( ReactorNotRestartable )

See original GitHub issue

Current Situation

Using scrapy as described in the tutorials ( this (intro/tutorial.html) and this (topics/practices.html) ) might/will throw an twisted.internet.error.ReactorNotRestartable error if run with unittest.

Both, using CrawlerProcess or CrawlerRunner will raise the error.

Also, there seems to be no (/ no easy) solution to this problem:

scrapy-twisted-internet-error-reactornotrestartable-error-after-first-run
reactornotrestartable-error-in-while-loop-with-scrapy
scrapy-reactornotrestartable-one-class-to-run-two-or-more-spiders (solution only works, if scrapy is not wrapped in other domain code)
scrapy-reactor-not-restartable (The accepted answer requires you to fork the process)

Suggestion

Make scrapy work in unittest environments without throwing the twisted.internet.error.ReactorNotRestartable error.

Remarks

Might be related to https://github.com/scrapy/scrapy/issues/2594 but the intention is different

Example code

The code below will reproduce the error. It does not contain a unit test, but is essentially has the same behavior.

# ====================================================================

# define quotes spicer

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

# ====================================================================

# Function to run spider
def run_spider():
   configure_logging()
   runner = CrawlerRunner({"LOG_LEVEL": 'ERROR'})
   runner.crawl(QuotesSpider)
   d = runner.join()
   d.addBoth(lambda _: reactor.stop())

   reactor.run() # the script will block here until all crawling jobs are finished

# ====================================================================

if "__main__" == __name__:
   run_spider()
   run_spider()   # This call will fail

Issue Analytics

State:
Created 3 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

4reactions

DarkTrickcommented, Oct 10, 2020

Well, I guess this bug is about API/architecture perception and language perception. So probably a very subjective topic (that we have different opinions about).

I’ll sum up my points and close the bug. I guess everything else just spins around:

I would say the docs do not describe what you are describing here.
I would say the “assumes basic knowlege of the Twisted reactor” argument is questionable because
1. The referenced stackoverflow questions have no general working solution ( ie. “it’s not only me, but apparently quite some people that have no idea how to solve it”)
2. “Assuming basic knowledge of twisted reactor” is not stated within the docs (did I miss it?), so it shouldn’t be a requirement
I would say “because the others also do it” should not be a heavy valid reason.

You can look at https://github.com/scrapy/scrapy/blob/master/tests/test_crawl.py for a correct implementation

Thank you very much for the link, I will try to make use it.

1reaction

wRARcommented, Oct 3, 2020

I was expecting, that it’s a more known problem, that scrapy would not perform well in unittests.

I still don’t know what do you mean by that and you didn’t provide any unittest-related code. Note that Scrapy itself has an extensive test suite which uses it in a variety of modes.

This is strange. From the page above I’m using this code:

And this code works as expected. On the other hand, the code you added to the first post is incorrect and goes against the Scrapy documentation: you have effectively reimplemented CrawlerProcess as your run_spider() function and so it cannot run twice as well. You are supposed to start (and stop) the reactor only once. This is not specific to Scrapy, by the way.