`response.follow_all()` problem with `cb_kwargs` getting shared by all requests objects.
See original GitHub issueDescription
I have a weird issue. I am scraping the following Arabic poetry website: https://diwany.org/. I scraped meters ‘called bahr in Arabic poetry’ from their links at the bottom of the main website page. The xpath
of these links is provided in the spider code snippet below.
I used follow_all()
method with a poems list in cb_kwargs
dictionary to collect all poems belonging to a meter as they are distributed among many pages. The issue is that this list got shared between all the meters requests. It keeps appending poems links from the previous meters’ requests while it should pass an empty list on each request. I am sure this is not the intended behavior, Is it ?!
On the other hand, this behavior does not happen when I use the follow()
method with a for
loop!
Steps to Reproduce
Here is my spider code for the website. I also put comments on the important parts.
import scrapy
class PoemsSpider(scrapy.Spider):
name = 'poems'
allowed_domains = ['diwany.org']
start_urls = ['http://diwany.org/']
def parse(self, response):
bahrs_links = response.xpath("//div[@class='menu-behers-ar-container']//a")
# this is the floawed method. poems links keeps appending from the previous requests. I got that from the print statement in the callback
# yield from response.follow_all(
# bahrs_links, callback=self.parse_bahr, cb_kwargs=dict(poems_links=list())
# )
# This code behaves as expected
for link in bahrs_links:
yield response.follow(link, callback=self.parse_bahr, cb_kwargs=dict(poems_links=list()))
# bahr means meter
def parse_bahr(self, response, poems_links):
# a debugging statement
# it gives the previous poems links from the previous requests even if this is a new request when using follow_all(). but it prints an empty list when using follow()
print('the initial number of poems links for the url:',response.url,'is',len(poems_links))
poems_links += response.xpath("//div[@class='post hentry ivycat-post']/h2//a")
# this is to collect poems links for the current meter as the website uses pagination
next_page = response.css(".pip-nav-next a")
if next_page:
yield response.follow(
next_page[0],
callback=self.parse_bahr,
cb_kwargs=dict(poems_links=poems_links),
)
else:
yield from response.follow_all(
poems_links, callback=self.parse_poem,
)
def parse_poem(self, response):
'''rest of the code'''
'''yield item'''
return None
Expected behavior:
The poems list in cb_kwargs
should be independent for each selector object!
Actual behavior:
The poems list appends links from the previous calls.
Reproduces how often:
this is consistent behavior. Although I knew that if I used the duplicate filter, these links would be filtered out, but this is not the intended behavior especially when I am building some logic on this assumption!
Versions
Please paste here the output of executing scrapy version --verbose
in the command line.
Scrapy : 2.3.0 lxml : 4.5.2.0 libxml2 : 2.9.10 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 20.3.0 Python : 3.8.2 (default, Jul 16 2020, 14:00:26) - [GCC 9.3.0] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020) cryptography : 3.0 Platform : Linux-5.4.0-47-generic-x86_64-with-glibc2.29
Additional context
I went through the code but did not found anything suspicious. I am not sure if this problem is because of the generator’s behaviors?
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
@MiroMz
Thanks for notifying me with the website terms of use. It seems that this page [terms of usage] is recently added. I want to let you know that we did not scrape the website to reproduce it. We tried to do so for academic research purposes.
I do not think this is the best place to discuss this stuff. Could you please drop your email or the team email so that we can communicate?
Dear Maged Saeed,
I want to draw your attention that your attempt to download parts of our website is violating our terms of use which can be unlawful and punishable.
from our term of use agreement: “(d) except as expressly stated herein, no part of the Site may be copied, reproduced, distributed, republished, downloaded, displayed, posted or transmitted in any form or by any means unless otherwise indicated, any future release, update, or other addition to functionality of the Site shall be subject to these Terms. All copyright and other proprietary notices on the Site must be retained on all copies thereof.”
We ask you kindly to first desist from violating our ToU, we consider your request on Github as evidence for such an attempt and we keep all our legal rights to proceed with official procedures regarding this issue in the EU and beyond.
Best regards, Diwany team