question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

selector create_root_node memory issues

See original GitHub issue

I have concerns about memory efficiency of create_root_node function used for creating root lxml.etree for Selector:

https://github.com/scrapy/parsel/blob/7ed4b24a9b8ef874c644c6cec01654539bc66cc3/parsel/selector.py#L47-L55 And especially this line: https://github.com/scrapy/parsel/blob/7ed4b24a9b8ef874c644c6cec01654539bc66cc3/parsel/selector.py#L50

Steps for reproduce I made… scrapy spider (tested on zyte scrapy cloud stack scrapy:2.4). It include:

  • get_virtual_size to know amount of allocated memory (directly the same as on scrapy MemoryUsage extension)
  • that problematic code line input.strip().replace('\x00', '').encode('utf8') inside start_requests method - It is enough for tests for this specific issue
spider code
import sys
from importlib import import_module
import scrapy

class MemoryCheckSpider(scrapy.Spider):
    name = 'parsel_memory'

    def get_virtual_size(self):
        size = self.resource.getrusage(self.resource.RUSAGE_SELF).ru_maxrss
        if sys.platform != 'darwin':
            # on macOS ru_maxrss is in bytes, on Linux it is in KB
            size *= 1024
        return size

    def check_memory_usage(self, i):
        self.logger.info(f"used memory on start: {str(self.get_virtual_size())}")
        input = i*1000*1000*10 #should be ~95 megabytes or ~100 000 000 bytes
        self.logger.info(f"size of input: {str(sys.getsizeof(input))}")
        self.logger.info(f"used memory after input: {str(self.get_virtual_size())}")
        output = input.strip().replace('\x00', '').encode('utf8')  # < - checking this code line
        self.logger.info(f"used memory after output: {str(self.get_virtual_size())}")

    def start_requests(self):
        try:
            self.resource = import_module('resource')
        except ImportError:
            pass
        self.check_memory_usage('ten__bytes')

self.check_memory_usage('ten__bytes')

9:	2021-02-17 15:43:27	INFO	[parsel_memory] used memory on start: 61562880
10:	2021-02-17 15:43:27	INFO	[parsel_memory] size of input: 100000049
11:	2021-02-17 15:43:27	INFO	[parsel_memory] used memory after input: 169148416
12:	2021-02-17 15:43:27	INFO	[parsel_memory] used memory after output: 259026944

In this case input doesn’t have space symbols for .strip() and \x00 for .replace() - that functions returned original input. .encode('uft8') -> converted it to bytes and on original create_root_node function -> allocated required memory it’s body variable.

self.check_memory_usage(' ten_bytes')

9:	2021-02-17 15:45:33	INFO	[parsel_memory] used memory on start: 61681664
10:	2021-02-17 15:45:33	INFO	[parsel_memory] size of input: 100000049
11:	2021-02-17 15:45:33	INFO	[parsel_memory] used memory after input: 169275392
12:	2021-02-17 15:45:34	INFO	[parsel_memory] used memory after output: 359186432

new input starts with space symbol and strip() function created new str (for immutable data types like str or bytes - result of each operation generate new object and require new memory allocation (if input is not equals to output). As result we can see that allocated memory increased by ~1 input size comparing to previous (ten__bytes) check.

Additional note about replace('\x00', '')

when scrapy creates response.text str from response.body bytes to create Selector object in practice that converted str have around ~2x memory size of response.body because python converts single byte symbols like \x00 into 4-byte unicode symbol:

>>> len(b'\x00')
1
>>> len('\x00')
1
>>> len(str(b'\x00'))
7
>>> len(str(b'\x00\x00'))
11

And of course It also affecting memory usage:

>>> from sys import getsizeof
>>> getsizeof(b'\x00'*1000)
1033
>>> getsizeof(str(b'\x00'*1000))
4052
Hovewer:
>>> getsizeof(str(b'\x00'*1000, encoding='utf8'))
1049

Counting the fact that response.body have around ~5x original response sizes after unzip in httpcompression middleware. It means that response.text used for creating selector (and text for create_root_node function) - is str with memory size around ~10 times more that original http response.

And finally bytes output after replace('\x00', '') used to call lxml.etree.fromstring looks like original response.body bytes just after httpcompression middleware.

Is it really necessary to convert response.body bytes to str (when creating Selector object) and again convert it back to bytes to create etree inside create_root_node?

In this case Selector object probably should accept bytes as input and directly call lxml.etree.fromstring with response.body arg without additional bytes to str and str to bytes convertations and it’s memory intensive consequences.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
GeorgeA92commented, Feb 18, 2021

@pawelmhm

So just removing this single line gives us 20% improvement in memory usage.

I think that real impact of this is much more than 20%

At this moment I have following estimations of allocated memory per Response(body) in scrapy: 1.from download handler (received from twisted side) - ~1r (bytes) (1 original response body size or 1r) 2.from DownloaderStats issue https://github.com/scrapy/scrapy/issues/4964 ~1r (bytes), ~2 in total. 3.after from httpcompression downloadermiddleware https://github.com/scrapy/scrapy/issues/4797 (accessible from spider parse methods as response.body) ~5r (bytes), ~7r in total 4.when creating response.text from response.body for creating selector ~10r (str), ~17r in total
5.create root node, after .strip() ~10r (str), 27r in total. 6.create root node, after replace('\x00', '').encode('utf8') ~5r (bytes), ~32r in total

result of step 6 - used for creating root node. In total to process single html response (on worst case) it is required to allocate ~32 times more memory than original response body.

In case if we will use result of step 3 to create root node - we can reduce amount of allocated memory from 32r (22r) to 7r (to 6r after fix of https://github.com/scrapy/scrapy/issues/4964).

0reactions
kmikecommented, Mar 12, 2021

As far as I understand - result of this function is used to stop spiders on scrapy cloud - this is main reason of usage of this method for tests.

I see, this makes sense. Currently MemoryUsage extension checks peak memory every N seconds, and stops the spider if a peak was larger than a specified amount.

The idea behind the extension is to prevent OOMs, and stop gracefully. For example, if hard memory limit is 5GB, scrapy’s limit may be set to 4GB. Let’s say Scrapy’s consumption is growing linearly with the amount of requests scheduled (e.g. because of dupefilter), and there are regular 1.5GB spikes caused by temporary allocations. If Memory extension would be measuring current memory usage, it is unlikely it’d see the spike, so it’ll let memory grow beyond 3.5GB, and then a spike might kill the process. However, you use max memory, as Scrapy’s extension does, a spider would be stopped effectively at 2.5GB, when a spike happens, preventing OOM.

Obviously, this is not a silver bullet. Next spike can be larger than spikes seen before, and a buffer between hard limit and Scrapy limit may be not enough to account for this. It also has an efficiency issue: we don’t use some of the memory which could have been used, to have a “buffer” for future spikes. It you’re really-really tight on memory, it might be possible to tune the extension behavior to match behavior of your spider better. In some cases it may be ok to disable it completely, if you know you can survive data issues caused by OOMs, and need a more memory.

In this case conversion bytes -> str -> bytes is not needed because original input is already utf-8.

A good point. It might be possibile to add an optimization for this case, though the implementation and testing may be a bit tricky (how do we ensure that we won’t break it in future?). As for the implementation, for me focusing on removing str-> bytes conversion looks easier than trying to remove bytes -> str conversion, as Scrapy may be computing response.text anyways, e.g. because some middleware uses it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory loss: When to seek help
A number of conditions — not only Alzheimer's disease — can cause memory loss in older adults. Getting a prompt diagnosis and appropriate...
Read more >
Medications for Memory Loss
Alzheimer's and dementia medications – overview of cholinesterase inhibitors (Aricept, Exelon, Razadyne), memantine (Namenda) and memantine + donepezil ...
Read more >
Memory, Forgetfulness, and Aging: What's Normal and What's ...
As you age, you may wonder what is and is not normal memory loss. Learn the signs of Alzheimer's disease and dementia —...
Read more >
The Most Common Causes Of Memory Loss
Medications for related issues like anxiety, depression or sleep disorders may help address memory loss after stroke. Mental Health Issues.
Read more >
Memory Loss: Causes, Management & Tests
But progressive memory loss due to illnesses like Alzheimer's disease can be ... Select your next provider based on their credentials, office locations, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found