question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

response.follow_all or SelectorList.follow_all shortcut

See original GitHub issue

What do you think about adding response.follow_all shortcut, which returns a list of requests? This is inspired by this note in docs:

response.follow(response.css('li.next a')) is not valid because response.css returns a list-like object with selectors for all results, not a single selector. A for loop like in the example above, or response.follow(response.css(‘li.next a’)[0]) is fine.

So instead of

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

users would be able to write (in Python 3)

yield from response.follow_all(response.css('li.next a::attr(href)'), self.parse)

We can also add ‘css’ and ‘xpath’ support to it, as keyword arguments; it would shorten the code to this:

yield from response.follow_all(css='li.next a::attr(href)', callback=self.parse)

(this is a follow-up to https://github.com/scrapy/scrapy/issues/1940 and https://github.com/scrapy/scrapy/issues/2540)

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
immerrrcommented, Feb 27, 2017

@kmike, only a rough one.

I think of RequestSet as something that:

  • has a DeferredList-like API (asyncio.gather has the drawback of being a function rather than an object)
  • knows that it contains Requests as deferreds
  • has its own callback to be run when the request set is dealt with, probably an errback too, for symmetry has a Deferred that would fire when the requestset is being cleaned up
  • has its own 'meta' dictionary, much like the one shared between Request & Response objects
  • since it knows that it contains Requests, it can piggyback on the first received value and do response.meta['request_set'] = self so that the callbacks can access the shared data
  • (maybe) it should silently copy fields from request_set.meta to response.meta if they are unset in request.meta, or maybe even make request.meta a ChainDict with fallback to request_set.meta
  • it should wrap requests coming from its respective response callbacks unless specifically asked not to do that, e.g. with request.meta['request_set'] = None
  • (maybe) it should be possible to return other RequestSet from response callbacks
  • (maybe) returned RequestSets should be made nestable, i.e. to keep the parent RequestSet alive during their lifetime if not explicitly asked not to with request_set.meta['request_set'] = None (if nesting is considered, the request_set metadata key seems redundant and we might consider parent_set instead.
  • not sure if it’s worth it to make them nestable, i.e. if a certain response callback produces a different RequestSet, should it be owned by the parent request set?

One more thing to consider is cross-referencing RequestSets, i.e. when two requests that should belong to one RequestSet are produced by different callbacks and thus have different scopes. Maybe a simple WeakValueDictionary would suffice to lookup the sets and ensure the references are cleaned up as necessary. But then you’d have the usual get-or-create operation, that might be worth creating an etalon implementation for.

0reactions
immerrrcommented, Feb 27, 2017

@kmike done: #2600

Read more comments on GitHub >

github_iconTop Results From Across the Web

Requests and Responses — Scrapy 2.7.1 documentation
Scrapy uses Request and Response objects for crawling web sites. ... Return an iterable of Request instances to follow all links in urls...
Read more >
Scrapy follow vs follow_all - python - Stack Overflow
A little late: It is because the response.follow() does not accept css as parameter, so the code only fetches the page 1 and...
Read more >
Requests and Responses - 《Scrapy v2.4 Documentation》
Scrapy uses Request and Response objects for crawling web sites. ... Return an iterable of Request instances to follow all links in urls...
Read more >
Release Scrapy developers
response, and convenient shortcuts like response.xpath() and ... 'followall' is the name of one of the spiders of the project.
Read more >
Developing Integration Projects with Oracle Data Integrator
schedule, stop a session, respond to a ping, or clean stale sessions. The standalone ... Follow all of the same processes as for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found