question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SpiderLoader - holding all project's spider modules in memory during scraping run.

See original GitHub issue

Description

On scrapy crawl spidername Original purpose of spiderloader is to search and load requred spider by it’s name. Hovewer current implementation of SpiderLoader literally import all project spider modules and hold it in memory up to end of process (even for cases when we need to load single spider for crawl command).

Probably it is already mentioned in https://github.com/scrapy/scrapy/issues/1805

Steps to Reproduce

  1. Creat Scrapy Project with at least 1 spider (valid enough to make 1 request and display memusage/startup stats)
  2. execute following script to generate required amount of spiders
mass gen spider code
import sys
import optparse

from scrapy.cmdline import _get_commands_dict, _pop_command_name,\
    _run_print_help, _run_command
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import inside_project, get_project_settings


def execute(argv=None, settings=get_project_settings()):
    if argv is None:
        argv = sys.argv

    inproject = inside_project()
    cmds = _get_commands_dict(settings, inproject)
    cmdname = _pop_command_name(argv)
    parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(),
                                   conflict_handler='resolve')
    cmd = cmds[cmdname]
    parser.usage = f"scrapy {cmdname} {cmd.syntax()}"
    parser.description = cmd.long_desc()
    settings.setdict(cmd.default_settings, priority='command')
    cmd.settings = settings
    cmd.add_options(parser)
    opts, args = parser.parse_args(args=argv[1:])
    _run_print_help(parser, cmd.process_options, args, opts)

    cmd.crawler_process = CrawlerProcess(settings)
    _run_print_help(parser, _run_command, cmd, args, opts)
    # sys.exit(cmd.exitcode) <- this line prevent multiple calls of scrapy commands in script

for i in range(3200):
    name = 'spider' + str(i)
    domain = f'www{str(i)}.quotes.toscrape.com' * 512  #result should be ~26Kb long
    execute(['scrapy', 'genspider', name, domain, '-t', 'crawl'])

  1. comparison of memusage/startup for different amount of spiders clearly indicates that scrapy allocates around ~(1.5x to 2x) of total spider code files memory size.
spiders in project memusage/startup spider files total size memusage_startup - memusage_startup(1spider)
1 51 748 864
101 55 906 304 2 689 024 4 157 440
201 60 928 000 5 351 424 9 179 136
401 69 873 664 10 676 224 18 124 800
801 88 117 248 21 325 824 36 368 384
1601 125 046 784 42 625 024 73 297 920
3201 199 233 536 85 223 424 147 484 672

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Gallaeciocommented, Feb 11, 2021

I’m unsure.

The question I linked is from 10 years ago, quoting a developer saying it will not be possible for at least 5 years. A response from last year mentions that in Python 3.8 he can see savings. So I probably was too quick to judge.

However, I’m not sure if it’s enough to just not keep a reference to the modules, or if they also need to be removed from sys.modules. And if it’s the latter, we may need to check the contents of sys.modules before we start loading spider modules, to make sure we do not later unload a module that had been imported before.

So, it may be possible, but not as trivial as I had hoped for.

1reaction
Gallaeciocommented, Feb 10, 2021

To be able to help you properly, it would be useful to know which points you found hard to understand.

This issue is about the SpiderLoader class keeping all spider modules in memory, specifically in self._spiders. Solving the issue should be relatively easy: instead of storing all item modules to later be able to return them from memory, do not store them and instead load them again when they are requested.

However, I’ve just found out that what I though would be trivial may be impossible. It seems Python modules cannot be unloaded. Which would mean that this issue cannot be solved 🙁

Read more comments on GitHub >

github_iconTop Results From Across the Web

Core API — Scrapy 2.7.1 documentation
This object provides access to all Scrapy core components, and it's the only way for extensions to access them and hook their functionality...
Read more >
Scrapy Documentation - Read the Docs
1. Creating a new Scrapy project. 2. Writing a spider to crawl a site and extract data. 3. Exporting the scraped data using...
Read more >
Scrapy Tutorial - Scrapy 文档 - Breword 文档集合
This tutorial will walk you through these tasks: Creating a new Scrapy project. Writing a spider to crawl a site and extract data....
Read more >
Scrapy - Quick Guide - Tutorialspoint
These are the settings, when running the spider, will be overridden from project wide configuration. 5. crawler. It is an attribute that links...
Read more >
Scrapy · Delft Students on Software Architecture - delftswa
The project architecture is built around the use of 'spiders'. ... Scrapy is written in Python and can be run on Linux, Mac...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found