Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy under Python 3 is slower than under Python 2

See original GitHub issue

bookworm benchmark from https://github.com/scrapy/scrapy-bench/ (see also https://medium.com/@vermaparth/parth-gsoc-f5556ffa4025) shows about 15% slowdown, while more synthetic scrapy bench shows a 2x slowdown: https://github.com/scrapy/scrapy/pull/3050#issuecomment-353863711

Issue Analytics

State:
Created 6 years ago
Comments:22 (22 by maintainers)

Top GitHub Comments

2reactions

lopuhincommented, Feb 25, 2018

I think I would start from comparing profiles (that means running under profiler) under Py2.7 and Py3.6 for several benchmarks and trying to spot where most of the difference comes from. For benchmarks, I think it makes sense to check several of them, because some might show more difference and some will be easier to analyze. For profilers, if you already have some preference then go with it. If not, I would suggest using built-in cProfile with some visualization backend (e.g. snakeviz), and vmprof + vmprof.com for visualization - it’s good to have several different profilers because this allows to cross-check profiling results.

1reaction

nctl144commented, Mar 13, 2018

I meant that the difference in pages/min is 2x (4320 vs 2040), while in total pagese crawls is about 30% (653 vs 421).

I tried running scrapy bench again and again and result is really unpredictable. However, the speed difference between python 2 and python 3 is still really high (about 30%-40%). But then again, the result I copied from the last line of the log is pretty weird. So I made a line of report which divides the pages crawled to the elapsed time, then I ran the benchmarker again and got this result: Python2:

2018-03-13 02:11:58 [scrapy.extensions.logstats] INFO: Crawled 605 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
('the result is', 621, 'pages and', 3542.474324191734, 'pages/min over', datetime.timedelta(0, 10, 518072))

Python3:

2018-03-13 02:13:38 [scrapy.extensions.logstats] INFO: Crawled 405 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
the result is 420 pages and 2325.8044029876987 pages/min over 0:00:10.834961

As you can see the result is basically the same. But I noticed that the spider speed is much slower over time. I will take the log on python 3 as an example:

2018-03-13 02:13:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:29 [scrapy.extensions.logstats] INFO: Crawled 60 pages (at 3600 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:30 [scrapy.extensions.logstats] INFO: Crawled 101 pages (at 2460 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:31 [scrapy.extensions.logstats] INFO: Crawled 156 pages (at 3300 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:32 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:33 [scrapy.extensions.logstats] INFO: Crawled 236 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:34 [scrapy.extensions.logstats] INFO: Crawled 269 pages (at 1980 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:35 [scrapy.extensions.logstats] INFO: Crawled 309 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:36 [scrapy.extensions.logstats] INFO: Crawled 340 pages (at 1860 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:37 [scrapy.extensions.logstats] INFO: Crawled 373 pages (at 1980 pages/min), scraped 0 items (at 0 items/min)

This also happens on Python 2 but it’s not that much. I guess that’s where the difference comes from.