Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`response.follow_all()` problem with `cb_kwargs` getting shared by all requests objects.

See original GitHub issue

Description

I have a weird issue. I am scraping the following Arabic poetry website: https://diwany.org/. I scraped meters ‘called bahr in Arabic poetry’ from their links at the bottom of the main website page. The xpath of these links is provided in the spider code snippet below.

I used follow_all() method with a poems list in cb_kwargs dictionary to collect all poems belonging to a meter as they are distributed among many pages. The issue is that this list got shared between all the meters requests. It keeps appending poems links from the previous meters’ requests while it should pass an empty list on each request. I am sure this is not the intended behavior, Is it ?!

On the other hand, this behavior does not happen when I use the follow() method with a for loop!

Steps to Reproduce

Here is my spider code for the website. I also put comments on the important parts.

import scrapy

class PoemsSpider(scrapy.Spider):
    name = 'poems'
    allowed_domains = ['diwany.org']
    start_urls = ['http://diwany.org/']

    def parse(self, response):
        bahrs_links = response.xpath("//div[@class='menu-behers-ar-container']//a")
        # this is the floawed method. poems links keeps appending from the previous requests. I got that from the print statement in the callback
        # yield from response.follow_all(
        #    bahrs_links, callback=self.parse_bahr, cb_kwargs=dict(poems_links=list())
        # )
        # This code behaves as expected
        for link in bahrs_links:
            yield response.follow(link, callback=self.parse_bahr, cb_kwargs=dict(poems_links=list()))

    # bahr means meter
    def parse_bahr(self, response, poems_links):
        # a debugging statement 
        # it gives the previous poems links from the previous requests even if this is a new request when using follow_all(). but it prints an empty list when using follow()
        print('the initial number of poems links for the url:',response.url,'is',len(poems_links))
        poems_links += response.xpath("//div[@class='post hentry ivycat-post']/h2//a")
        # this is to collect poems links for the current meter as the website uses pagination
        next_page = response.css(".pip-nav-next a")
        if next_page:
            yield response.follow(
                next_page[0],
                callback=self.parse_bahr,
                cb_kwargs=dict(poems_links=poems_links),
            )
        else:
            yield from response.follow_all(
                poems_links, callback=self.parse_poem,
            )

    def parse_poem(self, response):
        '''rest of the code'''
        '''yield item'''
        return None

Expected behavior:

The poems list in cb_kwargs should be independent for each selector object!

Actual behavior:

The poems list appends links from the previous calls.

Reproduces how often:

this is consistent behavior. Although I knew that if I used the duplicate filter, these links would be filtered out, but this is not the intended behavior especially when I am building some logic on this assumption!

Versions

Please paste here the output of executing scrapy version --verbose in the command line.

Scrapy : 2.3.0 lxml : 4.5.2.0 libxml2 : 2.9.10 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 20.3.0 Python : 3.8.2 (default, Jul 16 2020, 14:00:26) - [GCC 9.3.0] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020) cryptography : 3.0 Platform : Linux-5.4.0-47-generic-x86_64-with-glibc2.29

Additional context

I went through the code but did not found anything suspicious. I am not sure if this problem is because of the generator’s behaviors?

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

MagedSaeedcommented, Oct 26, 2020

@MiroMz

Thanks for notifying me with the website terms of use. It seems that this page [terms of usage] is recently added. I want to let you know that we did not scrape the website to reproduce it. We tried to do so for academic research purposes.

I do not think this is the best place to discuss this stuff. Could you please drop your email or the team email so that we can communicate?

1reaction

MiroMzcommented, Oct 22, 2020

Dear Maged Saeed,

I want to draw your attention that your attempt to download parts of our website is violating our terms of use which can be unlawful and punishable.

from our term of use agreement: “(d) except as expressly stated herein, no part of the Site may be copied, reproduced, distributed, republished, downloaded, displayed, posted or transmitted in any form or by any means unless otherwise indicated, any future release, update, or other addition to functionality of the Site shall be subject to these Terms. All copyright and other proprietary notices on the Site must be retained on all copies thereof.”

We ask you kindly to first desist from violating our ToU, we consider your request on Github as evidence for such an attempt and we keep all our legal rights to proceed with official procedures regarding this issue in the EU and beyond.

Best regards, Diwany team

Top Results From Across the Web

Interpreting callbacks and cb_kwargs with scrapy

1 Answer 1 · It should be start_requests , and self. · get() will return the first result, what you want is getall()...

Requests and Responses — Scrapy 2.7.1 documentation

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the ...

Release 2.4.1 Scrapy developers

next_page = response.css('li.next a::attr("href")').get() ... This spider will start from the main page, it will follow all the links to the ...

Scrapy - Requests and Responses - Tutorialspoint

The request object is a HTTP request that generates a response. It has the following class − class scrapy.http.Request(url[, callback, method = 'GET', ......