selector create_root_node memory issues
See original GitHub issueI have concerns about memory efficiency of create_root_node
function used for creating root
lxml.etree
for Selector
:
https://github.com/scrapy/parsel/blob/7ed4b24a9b8ef874c644c6cec01654539bc66cc3/parsel/selector.py#L47-L55 And especially this line: https://github.com/scrapy/parsel/blob/7ed4b24a9b8ef874c644c6cec01654539bc66cc3/parsel/selector.py#L50
Steps for reproduce
I made… scrapy spider (tested on zyte scrapy cloud stack scrapy:2.4
).
It include:
get_virtual_size
to know amount of allocated memory (directly the same as on scrapyMemoryUsage
extension)- that problematic code line
input.strip().replace('\x00', '').encode('utf8')
insidestart_requests
method - It is enough for tests for this specific issue
spider code
import sys
from importlib import import_module
import scrapy
class MemoryCheckSpider(scrapy.Spider):
name = 'parsel_memory'
def get_virtual_size(self):
size = self.resource.getrusage(self.resource.RUSAGE_SELF).ru_maxrss
if sys.platform != 'darwin':
# on macOS ru_maxrss is in bytes, on Linux it is in KB
size *= 1024
return size
def check_memory_usage(self, i):
self.logger.info(f"used memory on start: {str(self.get_virtual_size())}")
input = i*1000*1000*10 #should be ~95 megabytes or ~100 000 000 bytes
self.logger.info(f"size of input: {str(sys.getsizeof(input))}")
self.logger.info(f"used memory after input: {str(self.get_virtual_size())}")
output = input.strip().replace('\x00', '').encode('utf8') # < - checking this code line
self.logger.info(f"used memory after output: {str(self.get_virtual_size())}")
def start_requests(self):
try:
self.resource = import_module('resource')
except ImportError:
pass
self.check_memory_usage('ten__bytes')
self.check_memory_usage('ten__bytes')
9: 2021-02-17 15:43:27 INFO [parsel_memory] used memory on start: 61562880
10: 2021-02-17 15:43:27 INFO [parsel_memory] size of input: 100000049
11: 2021-02-17 15:43:27 INFO [parsel_memory] used memory after input: 169148416
12: 2021-02-17 15:43:27 INFO [parsel_memory] used memory after output: 259026944
In this case input doesn’t have space symbols for .strip()
and \x00
for .replace()
- that functions returned original input.
.encode('uft8')
-> converted it to bytes and on original create_root_node
function -> allocated required memory it’s body
variable.
self.check_memory_usage(' ten_bytes')
9: 2021-02-17 15:45:33 INFO [parsel_memory] used memory on start: 61681664
10: 2021-02-17 15:45:33 INFO [parsel_memory] size of input: 100000049
11: 2021-02-17 15:45:33 INFO [parsel_memory] used memory after input: 169275392
12: 2021-02-17 15:45:34 INFO [parsel_memory] used memory after output: 359186432
new input starts with space
symbol and strip()
function created new str
(for immutable data types like str
or bytes
- result of each operation generate new object and require new memory allocation (if input is not equals to output).
As result we can see that allocated memory increased by ~1 input size comparing to previous (ten__bytes
) check.
Additional note about replace('\x00', '')
when scrapy creates response.text
str
from response.body
bytes to create Selector
object
in practice that converted str
have around ~2x memory size of response.body
because python converts single byte symbols like \x00
into 4-byte unicode symbol:
>>> len(b'\x00')
1
>>> len('\x00')
1
>>> len(str(b'\x00'))
7
>>> len(str(b'\x00\x00'))
11
And of course It also affecting memory usage:
>>> from sys import getsizeof
>>> getsizeof(b'\x00'*1000)
1033
>>> getsizeof(str(b'\x00'*1000))
4052
Hovewer:
>>> getsizeof(str(b'\x00'*1000, encoding='utf8'))
1049
Counting the fact that response.body
have around ~5x original response sizes after unzip in httpcompression middleware.
It means that response.text
used for creating selector (and text
for create_root_node
function) - is str
with memory size around ~10 times more that original http response.
And finally bytes output after replace('\x00', '')
used to call lxml.etree.fromstring
looks like original response.body
bytes just after httpcompression middleware.
Is it really necessary to convert response.body
bytes to str
(when creating Selector
object) and again convert it back to bytes to create etree
inside create_root_node
?
In this case Selector
object probably should accept bytes as input and directly call lxml.etree.fromstring
with response.body
arg without additional bytes to str and str to bytes convertations and it’s memory intensive consequences.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
@pawelmhm
I think that real impact of this is much more than 20%
At this moment I have following estimations of allocated memory per Response(body) in scrapy: 1.from download handler (received from
twisted
side) - ~1r (bytes) (1 original response body size or 1r) 2.from DownloaderStats issue https://github.com/scrapy/scrapy/issues/4964 ~1r (bytes), ~2 in total. 3.after from httpcompression downloadermiddleware https://github.com/scrapy/scrapy/issues/4797 (accessible from spider parse methods asresponse.body
) ~5r (bytes), ~7r in total 4.when creatingresponse.text
fromresponse.body
for creating selector ~10r (str), ~17r in total5.create root node, after
.strip()
~10r (str), 27r in total. 6.create root node, afterreplace('\x00', '').encode('utf8')
~5r (bytes), ~32r in totalresult of step 6 - used for creating root node. In total to process single html response (on worst case) it is required to allocate ~32 times more memory than original response body.
In case if we will use result of step 3 to create root node - we can reduce amount of allocated memory from 32r (22r) to 7r (to 6r after fix of https://github.com/scrapy/scrapy/issues/4964).
I see, this makes sense. Currently
MemoryUsage
extension checks peak memory every N seconds, and stops the spider if a peak was larger than a specified amount.The idea behind the extension is to prevent OOMs, and stop gracefully. For example, if hard memory limit is 5GB, scrapy’s limit may be set to 4GB. Let’s say Scrapy’s consumption is growing linearly with the amount of requests scheduled (e.g. because of dupefilter), and there are regular 1.5GB spikes caused by temporary allocations. If Memory extension would be measuring current memory usage, it is unlikely it’d see the spike, so it’ll let memory grow beyond 3.5GB, and then a spike might kill the process. However, you use max memory, as Scrapy’s extension does, a spider would be stopped effectively at 2.5GB, when a spike happens, preventing OOM.
Obviously, this is not a silver bullet. Next spike can be larger than spikes seen before, and a buffer between hard limit and Scrapy limit may be not enough to account for this. It also has an efficiency issue: we don’t use some of the memory which could have been used, to have a “buffer” for future spikes. It you’re really-really tight on memory, it might be possible to tune the extension behavior to match behavior of your spider better. In some cases it may be ok to disable it completely, if you know you can survive data issues caused by OOMs, and need a more memory.
A good point. It might be possibile to add an optimization for this case, though the implementation and testing may be a bit tricky (how do we ensure that we won’t break it in future?). As for the implementation, for me focusing on removing str-> bytes conversion looks easier than trying to remove bytes -> str conversion, as Scrapy may be computing response.text anyways, e.g. because some middleware uses it.