S3FilesStore can use a lot of memory
See original GitHub issueHi,
@nramirezuy and me were debugging memory issue with one of the spiders some time ago, and it seems to be caused by ImagesPipeline + S3FilesStore. I haven’t confirmed that it was the cause of memory issue, this ticket is based solely on reading the source code.
FilesPipeline reads the whole file to memory and then defers the uploading to thread (via S3FilesStore.persist_file
, passing file contents as bytes). So there could be many files loaded to memory at the same time, and as soon as files are downloaded faster than they are are uploaded to s3, memory usage will grow. This is not unlikely IMHO because s3 is not super-fast. For ImagesPipeline it is worse because it uploads not only the image itself, but also the generated thumbnails.
I think S3FilesStore should persist files to temporary location before uploading them to S3 (at least optionally). This would allow streaming files without storing them in memory.
Issue Analytics
- State:
- Created 10 years ago
- Comments:12 (7 by maintainers)
Top GitHub Comments
Hey. It seems this CVE is popping up everywhere, and can cause some warnings for the users, so we should do something about it.
Let’s investigate possible solutions. https://nvd.nist.gov/vuln/detail/CVE-2017-14158 tells that the issue is an instance of https://cwe.mitre.org/data/definitions/400.html. Description of CWE-400:
Scrapy controls the number and size of the resources in the following way:
Reducing the amount of RAM, e.g. storing more data to disk, is not a solution for CVE. By doing so we’re shifting the resource from “memory usage” to “disk usage”, without addressing the security issue.
It seems that having an option similar to SCRAPER_SLOT_MAX_ACTIVE_SIZE, but which works on Downloader level or on Engine level should solve the problem. I.e. limit (in a soft way - if we’re over limit, stop accepting new work) the total byte size of responses being processed by all Scrapy components. What do you think?
Hi, is there a plan to fix this? I have seen this vulnerability for a while, but I don’t see a clear solution for it. Thanks!