question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

S3FilesStore can use a lot of memory

See original GitHub issue

Hi,

@nramirezuy and me were debugging memory issue with one of the spiders some time ago, and it seems to be caused by ImagesPipeline + S3FilesStore. I haven’t confirmed that it was the cause of memory issue, this ticket is based solely on reading the source code.

FilesPipeline reads the whole file to memory and then defers the uploading to thread (via S3FilesStore.persist_file, passing file contents as bytes). So there could be many files loaded to memory at the same time, and as soon as files are downloaded faster than they are are uploaded to s3, memory usage will grow. This is not unlikely IMHO because s3 is not super-fast. For ImagesPipeline it is worse because it uploads not only the image itself, but also the generated thumbnails.

I think S3FilesStore should persist files to temporary location before uploading them to S3 (at least optionally). This would allow streaming files without storing them in memory.

Issue Analytics

  • State:open
  • Created 10 years ago
  • Comments:12 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
kmikecommented, Nov 8, 2022

Hey. It seems this CVE is popping up everywhere, and can cause some warnings for the users, so we should do something about it.

Let’s investigate possible solutions. https://nvd.nist.gov/vuln/detail/CVE-2017-14158 tells that the issue is an instance of https://cwe.mitre.org/data/definitions/400.html. Description of CWE-400:

The software does not properly control the allocation and maintenance of a limited resource, thereby enabling an actor to influence the amount of resources consumed, eventually leading to the exhaustion of available resources.

Limited resources include memory, file system storage, database connection pool entries, and CPU. If an attacker can trigger the allocation of these limited resources, but the number or size of the resources is not controlled, then the attacker could cause a denial of service that consumes all available resources.

Scrapy controls the number and size of the resources in the following way:

  1. There is DOWNLOAD_MAXSIZE (https://docs.scrapy.org/en/latest/topics/settings.html#download-maxsize) option, to prevent downloading large files.
  2. There are various CONCURRENT_REQUESTS options, which allow to limit the amount of parallel downloads (each download then causes an upload to S3).
  3. There is SCRAPER_SLOT_MAX_ACTIVE_SIZE, which is a soft limit for total size of all responses being processed by scraper (“While the sum of the sizes of all responses being processed is above this value, Scrapy does not process new requests.”). I’m not sure though why is it applied on Scraper level, not on Downloader level. So, it seems this option doesn’t have an effect if a request is added to downloader without going through scraper, and it seems requests initiated in MediaPipeline don’t go through scraper.

Reducing the amount of RAM, e.g. storing more data to disk, is not a solution for CVE. By doing so we’re shifting the resource from “memory usage” to “disk usage”, without addressing the security issue.

It seems that having an option similar to SCRAPER_SLOT_MAX_ACTIVE_SIZE, but which works on Downloader level or on Engine level should solve the problem. I.e. limit (in a soft way - if we’re over limit, stop accepting new work) the total byte size of responses being processed by all Scrapy components. What do you think?

1reaction
genismorenocommented, Nov 8, 2022

Hi, is there a plan to fix this? I have seen this vulnerability for a while, but I don’t see a clear solution for it. Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot high memory use in an ElastiCache cluster
High swap usage : It's normal to see some swap usage on a cache node when there is free memory. However, too much...
Read more >
Safari is taking too much memory | Apple Developer Forums
I'm trying to use the new geat features with Safari with El Capitan in my Macbook ... I can see the safari's memory...
Read more >
Why does R use so much memory when using read.csv()?
Bit odd that it's reporting less memory, but I guess that's just because CSV isn't an efficient storage method for numeric data. But...
Read more >
Why Is Google Chrome Using So Much RAM? Here's How to ...
Why does Google Chrome use so much RAM? What can you do to keep it in check? Here's how to make Chrome use...
Read more >
Firefox uses too much memory or CPU resources - How to fix
Depending on your operating system, you can review and monitor resource usage through specific tools. See the Use additional troubleshooting tools section below ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found