question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy under Python 3 is slower than under Python 2

See original GitHub issue

bookworm benchmark from https://github.com/scrapy/scrapy-bench/ (see also https://medium.com/@vermaparth/parth-gsoc-f5556ffa4025) shows about 15% slowdown, while more synthetic scrapy bench shows a 2x slowdown: https://github.com/scrapy/scrapy/pull/3050#issuecomment-353863711

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:22 (22 by maintainers)

github_iconTop GitHub Comments

2reactions
lopuhincommented, Feb 25, 2018

I think I would start from comparing profiles (that means running under profiler) under Py2.7 and Py3.6 for several benchmarks and trying to spot where most of the difference comes from. For benchmarks, I think it makes sense to check several of them, because some might show more difference and some will be easier to analyze. For profilers, if you already have some preference then go with it. If not, I would suggest using built-in cProfile with some visualization backend (e.g. snakeviz), and vmprof + vmprof.com for visualization - it’s good to have several different profilers because this allows to cross-check profiling results.

1reaction
nctl144commented, Mar 13, 2018

I meant that the difference in pages/min is 2x (4320 vs 2040), while in total pagese crawls is about 30% (653 vs 421).

I tried running scrapy bench again and again and result is really unpredictable. However, the speed difference between python 2 and python 3 is still really high (about 30%-40%). But then again, the result I copied from the last line of the log is pretty weird. So I made a line of report which divides the pages crawled to the elapsed time, then I ran the benchmarker again and got this result: Python2:

2018-03-13 02:11:58 [scrapy.extensions.logstats] INFO: Crawled 605 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
('the result is', 621, 'pages and', 3542.474324191734, 'pages/min over', datetime.timedelta(0, 10, 518072))

Python3:

2018-03-13 02:13:38 [scrapy.extensions.logstats] INFO: Crawled 405 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
the result is 420 pages and 2325.8044029876987 pages/min over 0:00:10.834961

As you can see the result is basically the same. But I noticed that the spider speed is much slower over time. I will take the log on python 3 as an example:

2018-03-13 02:13:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:29 [scrapy.extensions.logstats] INFO: Crawled 60 pages (at 3600 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:30 [scrapy.extensions.logstats] INFO: Crawled 101 pages (at 2460 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:31 [scrapy.extensions.logstats] INFO: Crawled 156 pages (at 3300 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:32 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:33 [scrapy.extensions.logstats] INFO: Crawled 236 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:34 [scrapy.extensions.logstats] INFO: Crawled 269 pages (at 1980 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:35 [scrapy.extensions.logstats] INFO: Crawled 309 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:36 [scrapy.extensions.logstats] INFO: Crawled 340 pages (at 1860 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:37 [scrapy.extensions.logstats] INFO: Crawled 373 pages (at 1980 pages/min), scraped 0 items (at 0 items/min)

This also happens on Python 2 but it’s not that much. I guess that’s where the difference comes from.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy Python 3 vs Python 2 - selenium webdriver
I had a scrapy project with Python 2.7 and now I am moving to Python 3.6 but I have encountered a 'problem'. Whenever...
Read more >
Installation guide — Scrapy 1.8.3 documentation
Scrapy runs on Python 2.7 and Python 3.5 or above under CPython (default Python implementation) and PyPy (starting with PyPy 5.9).
Read more >
Python Web Crawlers : Extensive Overview of Crawling Software
In fact, the two terms have different meanings: web scraping has more to do with retrieving and structuring the webpage's data. On the ......
Read more >
Python 3 comes to Scrapy | Hacker News
The breakup between Python 2 and 3 has been very slow and painful. Python devs know that, and that's why they won't break...
Read more >
Python 2 vs Python 3: The Key Differences - Great Learning
In almost all tests conducted to check the performance speed of Python 3, it is found that Python 3 is faster than Python...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found