Remove UrlLengthMiddleware from default enabled middlewares
See original GitHub issueAccording RFC2396, section 3.2.1:
The HTTP protocol does not place any a priori limit on the length of a URI.
Servers MUST be able to handle the URI of any resource they serve, and
SHOULD be able to handle URIs of unbounded length if they provide
GET-based forms that could generate such URIs. A server SHOULD
return 414 (Request-URI Too Long) status if a URI is longer than the server
can handle (see section 10.4.15).
We have enabled by default scrapy.spidermiddlewares.urllength.UrlLengthMiddleware
that has a default limit defined by URLLENGTH_LIMIT
setting (that can be modified by in project settings) set to 2083
. As mentioned here, the reason for this number is related to limits of Microsoft Internet Explorer to handle URIs longer than that.
This can cause problems to spiders that will skip requests of URIs longer than that. Certainly we can change URLLENGTH_LIMIT
on these spiders, but sometimes is not easy to set the right value and we chose to set a higher number just to make the middleware happy. This is what I am doing in a real world project, but the solution doesn’t look good.
I know that we can or disable the middleware, or change the length limit, but I think it is smoother for the user not to have to worry about this artificial limit we have on Scrapy. We are not using Microsoft Internet Explorer, we don’t need this limit.
Some alternatives that I considered:
- Remove
UrlLengthMiddleware
as a default enabled middlewares, so we don’t need to worry about that limit unless we really need to worry about that (I don’t know the exact use-case that required this limit, so keeping the middleware available may make sense); - Change the default value to a more reasonable (difficult to find a reasonable value)
- Allow
URLLENGTH_LIMIT = -1
, and in this case, ignore the limit. This seems an easier change in the settings than modifyingSPIDER_MIDDLEWARES
setting
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:12 (11 by maintainers)
Top GitHub Comments
@rennerocha We need to add documentation and tests for it, but know that it turns out the existing code already disables the middleware if you set the setting to
0
.Yeah, why not? I think we’re using 0 for other settings as such value.