SpiderLoader - holding all project's spider modules in memory during scraping run.
See original GitHub issueDescription
On scrapy crawl spidername
Original purpose of spiderloader is to search and load requred spider by it’s name
.
Hovewer current implementation of SpiderLoader
literally import all project spider modules and hold it in memory up to end of process (even for cases when we need to load single spider for crawl
command).
Probably it is already mentioned in https://github.com/scrapy/scrapy/issues/1805
Steps to Reproduce
- Creat Scrapy Project with at least 1 spider (valid enough to make 1 request and display
memusage/startup
stats) - execute following script to generate required amount of spiders
mass gen spider code
import sys
import optparse
from scrapy.cmdline import _get_commands_dict, _pop_command_name,\
_run_print_help, _run_command
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import inside_project, get_project_settings
def execute(argv=None, settings=get_project_settings()):
if argv is None:
argv = sys.argv
inproject = inside_project()
cmds = _get_commands_dict(settings, inproject)
cmdname = _pop_command_name(argv)
parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(),
conflict_handler='resolve')
cmd = cmds[cmdname]
parser.usage = f"scrapy {cmdname} {cmd.syntax()}"
parser.description = cmd.long_desc()
settings.setdict(cmd.default_settings, priority='command')
cmd.settings = settings
cmd.add_options(parser)
opts, args = parser.parse_args(args=argv[1:])
_run_print_help(parser, cmd.process_options, args, opts)
cmd.crawler_process = CrawlerProcess(settings)
_run_print_help(parser, _run_command, cmd, args, opts)
# sys.exit(cmd.exitcode) <- this line prevent multiple calls of scrapy commands in script
for i in range(3200):
name = 'spider' + str(i)
domain = f'www{str(i)}.quotes.toscrape.com' * 512 #result should be ~26Kb long
execute(['scrapy', 'genspider', name, domain, '-t', 'crawl'])
- comparison of
memusage/startup
for different amount of spiders clearly indicates that scrapy allocates around ~(1.5x to 2x) of total spider code files memory size.
spiders in project | memusage/startup | spider files total size | memusage_startup - memusage_startup(1spider) |
---|---|---|---|
1 | 51 748 864 | ||
101 | 55 906 304 | 2 689 024 | 4 157 440 |
201 | 60 928 000 | 5 351 424 | 9 179 136 |
401 | 69 873 664 | 10 676 224 | 18 124 800 |
801 | 88 117 248 | 21 325 824 | 36 368 384 |
1601 | 125 046 784 | 42 625 024 | 73 297 920 |
3201 | 199 233 536 | 85 223 424 | 147 484 672 |
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Core API — Scrapy 2.7.1 documentation
This object provides access to all Scrapy core components, and it's the only way for extensions to access them and hook their functionality...
Read more >Scrapy Documentation - Read the Docs
1. Creating a new Scrapy project. 2. Writing a spider to crawl a site and extract data. 3. Exporting the scraped data using...
Read more >Scrapy Tutorial - Scrapy 文档 - Breword 文档集合
This tutorial will walk you through these tasks: Creating a new Scrapy project. Writing a spider to crawl a site and extract data....
Read more >Scrapy - Quick Guide - Tutorialspoint
These are the settings, when running the spider, will be overridden from project wide configuration. 5. crawler. It is an attribute that links...
Read more >Scrapy · Delft Students on Software Architecture - delftswa
The project architecture is built around the use of 'spiders'. ... Scrapy is written in Python and can be run on Linux, Mac...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I’m unsure.
The question I linked is from 10 years ago, quoting a developer saying it will not be possible for at least 5 years. A response from last year mentions that in Python 3.8 he can see savings. So I probably was too quick to judge.
However, I’m not sure if it’s enough to just not keep a reference to the modules, or if they also need to be removed from
sys.modules
. And if it’s the latter, we may need to check the contents ofsys.modules
before we start loading spider modules, to make sure we do not later unload a module that had been imported before.So, it may be possible, but not as trivial as I had hoped for.
To be able to help you properly, it would be useful to know which points you found hard to understand.
This issue is about the
SpiderLoader
class keeping all spider modules in memory, specifically inself._spiders
. Solving the issue should be relatively easy: instead of storing all item modules to later be able to return them from memory, do not store them and instead load them again when they are requested.However, I’ve just found out that what I though would be trivial may be impossible. It seems Python modules cannot be unloaded. Which would mean that this issue cannot be solved 🙁