OpenSSL unsafe legacy renegotiation disabled error in Scrapy Scrapy
Explanation of the problem
The following description highlights an issue with the SSL configuration of a website built with the Scrapy framework. The issue is causing an error with the SSL connection, and the expected behavior of loading an HTML page is not occurring.
The issue can be reproduced 100% of the time by running the following command in the Scrapy shell:
scrapy shell https://dorotheum.com
The system configuration for the Scrapy framework is as follows:
- Scrapy version: 2.6.1
- lxml version: 4.8.0.0
- libxml2 version: 2.9.4
- cssselect version: 1.1.0
- parsel version: 1.6.0
- w3lib version: 1.22.0
- Twisted version: 22.4.0
- Python version: 3.9.12
- pyOpenSSL version: 22.0.0 (OpenSSL 3.0.3 3 May 2022)
- cryptography version: 37.0.2
- Platform: macOS-12.2.1-arm64-arm-64bit
The error encountered is the following:
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'unsafe legacy renegotiation disabled')]>]
Troubleshooting with the Lightrun Developer Observability Platform
Getting a sense of what’s actually happening inside a live application is a frustrating experience, one that relies mostly on querying and observing whatever logs were written during development.
Lightrun is a Developer Observability Platform, allowing developers to add telemetry to live applications in real-time, on-demand, and right from the IDE.
- Instantly add logs to, set metrics in, and take snapshots of live applications
- Insights delivered straight to your IDE or CLI
- Works where you do: dev, QA, staging, CI/CD, and production
Start for free today
Problem solution for OpenSSL unsafe legacy renegotiation disabled error in Scrapy Scrapy
The issue at hand involves a compatibility problem between the Scrapy library, the cryptography library, and the OpenSSL library. The compatibility problem can cause errors when attempting to download content using the Scrapy library. The error is related to the cryptographic protocols used for secure connections.
One solution to this issue is to downgrade the cryptography library to version 36.0.2. The current version of Scrapy, 2.6.1, works well with cryptography version 36.0.2 and OpenSSL version 1.1.1n. Downgrading the cryptography library to version 36.0.2 solves the compatibility issue, and the user should be able to download content without any problems.
Scrapy : 2.6.1
lxml : 4.8.0.0
libxml2 : 2.9.4
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 22.4.0
Python : 3.9.12 (main, Mar 26 2022, 15:44:31) - [Clang 13.1.6 (clang-1316.0.21.2)]
pyOpenSSL : 22.0.0 (OpenSSL 1.1.1n 15 Mar 2022)
cryptography : 36.0.2
Platform : macOS-12.3.1-arm64-arm-64bit
Another solution involves creating a custom context factory, which is a small piece of code that helps to manage secure connections. The custom context factory provides a workaround for the compatibility issue between the Scrapy library, the cryptography library, and the OpenSSL library. To implement this solution, you would create a new Python file, contextfactory.py, which contains the code for the custom context factory. This file should be located within the same folder as your Scrapy project. In the code for the custom context factory, you would inherit from the ScrapyClientContextFactory class, and then add a custom getContext() method. The custom getContext() method sets the necessary cryptographic options to allow the Scrapy library to make secure connections successfully, even if there is a compatibility issue between the cryptography library and the OpenSSL library. Once you have created the custom context factory, you would then modify your Scrapy spider to use this custom context factory, by setting the ‘DOWNLOADER_CLIENTCONTEXTFACTORY’ setting in your my_spider.py
file:
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
class LegacyConnectContextFactory(ScrapyClientContextFactory):
def getContext(self, hostname=None, port=None):
ctx = self.getCertificateOptions().getContext()
ctx.set_options(0x4)
return ctx
Other popular problems with Scrapy
Problem: Memory Leaks with Twisted
Another common problem with Scrapy is memory leaks when using Twisted. Twisted is an event-driven networking engine that is required by Scrapy, but can cause memory leaks if not used correctly.
Solution:
To resolve this issue, it is recommended to use the latest version of Twisted and monitor memory usage regularly. The following code block shows how to check the current version of Twisted:
pip show Twisted
If the version of Twisted is outdated, it can be updated with the following code block:
pip install Twisted --upgrade
Additionally, it is recommended to use a memory profiling tool, such as memory_profiler, to monitor memory usage and identify potential leaks, as shown in the following code block:
pip install memory_profiler
Problem: Parsing Issues with lxml
A third common problem with Scrapy is parsing issues when using lxml. lxml is a required package for Scrapy, but different versions of lxml can cause compatibility issues with Scrapy.
Solution:
To resolve this issue, it is recommended to use the latest version of lxml and monitor the parsing results regularly. The following code block shows how to check the current version of lxml:
pip show lxml
If the version of lxml is outdated, it can be updated with the following code block:
pip install lxml --upgrade
Additionally, it is recommended to use a parsing tool, such as parsel, to monitor parsing results and identify potential issues, as shown in the following code block:
pip install parsel
Problem: Inconsistent/Inaccurate Data Extraction
Scrapy can be used to extract data from websites and other sources, but there are situations where the data being extracted may be inconsistent or inaccurate. This can be due to the structure of the website or the data itself.
Solution:
Create a custom Item Loader class in Scrapy that can be used to process and clean the extracted data before it is passed to the item pipeline. The custom Item Loader class can be used to define custom data processing methods that can be used to clean and validate the data.
For example, the following code shows a custom Item Loader class that can be used to clean and validate data:
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst
class MyItemLoader(ItemLoader):
default_input_processor = MapCompose(str.strip)
default_output_processor = TakeFirst()
def process_name(value):
return value.title()
In this example, the custom Item Loader class uses the default_input_processor to strip whitespace from the extracted data and the default_output_processor to take only the first value. The custom process_name method can then be used to process the extracted data by capitalizing the first letter of each word.
A brief introduction to Scrapy
Scrapy is an open-source and collaborative web crawling framework for Python. It provides a simple way for developers to extract the data they need from websites, by defining rules for the data they want to scrape. This framework is designed to be scalable, fast, and flexible, allowing developers to easily integrate it into their projects and automate the process of scraping.
Scrapy is built on top of the Twisted asynchronous networking framework, which provides high-performance and low-latency web crawling capabilities. This framework allows developers to run multiple requests simultaneously, which enables them to crawl large numbers of pages quickly and efficiently. Additionally, Scrapy includes built-in support for handling common web scraping tasks, such as logging, parsing, and data storage, which reduces the amount of custom code that developers need to write. With its well-documented API, customizable settings, and powerful data extraction capabilities, Scrapy is a popular choice for many web scraping projects.
Most popular use cases for Scrapy
- Web Scraping and Data Collection: Scrapy can be used for extracting and collecting data from websites, either for personal or commercial purposes. It can handle large volumes of data and store it in a structured format, such as CSV, JSON or XML, for further analysis.
- Automating Tasks: Scrapy can be used to automate repetitive and time-consuming tasks, such as monitoring websites for updates, collecting data on a schedule, and sending notifications based on specific criteria. The following code block is an example of using Scrapy to monitor a website for updates, and sending an email notification when a change is detected:
import scrapy
from scrapy.exceptions import CloseSpider
class MonitorSpider(scrapy.Spider):
name = "monitor"
start_urls = [
'http://www.example.com/page',
]
custom_settings = {
'DOWNLOAD_DELAY': 2,
}
def parse(self, response):
# Compare the current page content to the previous version
if response.text != self.prev_content:
self.prev_content = response.text
# Send an email notification of the change
send_email(to='you@example.com', subject='Website Changed',
body='The website has changed: {}'.format(response.url))
raise CloseSpider('Website changed')
- Web crawling: Scrapy can be used to automate the process of crawling websites and extracting data from them. It can follow links from one page to another and extract data from multiple pages in a structured manner. Scrapy spiders can be set up to run at regular intervals, making it possible to regularly update a database of information from the web.
It’s Really not that Complicated.
You can actually understand what’s going on inside your live applications.