Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SpiderLoader - holding all project's spider modules in memory during scraping run.

See original GitHub issue

Description

On scrapy crawl spidername Original purpose of spiderloader is to search and load requred spider by it’s name. Hovewer current implementation of SpiderLoader literally import all project spider modules and hold it in memory up to end of process (even for cases when we need to load single spider for crawl command).

Probably it is already mentioned in https://github.com/scrapy/scrapy/issues/1805

Steps to Reproduce

Creat Scrapy Project with at least 1 spider (valid enough to make 1 request and display memusage/startup stats)
execute following script to generate required amount of spiders

mass gen spider code

import sys
import optparse

from scrapy.cmdline import _get_commands_dict, _pop_command_name,\
    _run_print_help, _run_command
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import inside_project, get_project_settings


def execute(argv=None, settings=get_project_settings()):
    if argv is None:
        argv = sys.argv

    inproject = inside_project()
    cmds = _get_commands_dict(settings, inproject)
    cmdname = _pop_command_name(argv)
    parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(),
                                   conflict_handler='resolve')
    cmd = cmds[cmdname]
    parser.usage = f"scrapy {cmdname} {cmd.syntax()}"
    parser.description = cmd.long_desc()
    settings.setdict(cmd.default_settings, priority='command')
    cmd.settings = settings
    cmd.add_options(parser)
    opts, args = parser.parse_args(args=argv[1:])
    _run_print_help(parser, cmd.process_options, args, opts)

    cmd.crawler_process = CrawlerProcess(settings)
    _run_print_help(parser, _run_command, cmd, args, opts)
    # sys.exit(cmd.exitcode) <- this line prevent multiple calls of scrapy commands in script

for i in range(3200):
    name = 'spider' + str(i)
    domain = f'www{str(i)}.quotes.toscrape.com' * 512  #result should be ~26Kb long
    execute(['scrapy', 'genspider', name, domain, '-t', 'crawl'])

comparison of memusage/startup for different amount of spiders clearly indicates that scrapy allocates around ~(1.5x to 2x) of total spider code files memory size.

spiders in project	memusage/startup	spider files total size	memusage_startup - memusage_startup(1spider)
1	51 748 864
101	55 906 304	2 689 024	4 157 440
201	60 928 000	5 351 424	9 179 136
401	69 873 664	10 676 224	18 124 800
801	88 117 248	21 325 824	36 368 384
1601	125 046 784	42 625 024	73 297 920
3201	199 233 536	85 223 424	147 484 672

Issue Analytics

State:
Created 3 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

Gallaeciocommented, Feb 11, 2021

I’m unsure.

The question I linked is from 10 years ago, quoting a developer saying it will not be possible for at least 5 years. A response from last year mentions that in Python 3.8 he can see savings. So I probably was too quick to judge.

However, I’m not sure if it’s enough to just not keep a reference to the modules, or if they also need to be removed from sys.modules. And if it’s the latter, we may need to check the contents of sys.modules before we start loading spider modules, to make sure we do not later unload a module that had been imported before.

So, it may be possible, but not as trivial as I had hoped for.

1reaction

Gallaeciocommented, Feb 10, 2021

To be able to help you properly, it would be useful to know which points you found hard to understand.

This issue is about the SpiderLoader class keeping all spider modules in memory, specifically in self._spiders. Solving the issue should be relatively easy: instead of storing all item modules to later be able to return them from memory, do not store them and instead load them again when they are requested.

However, I’ve just found out that what I though would be trivial may be impossible. It seems Python modules cannot be unloaded. Which would mean that this issue cannot be solved 🙁