question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Remove UrlLengthMiddleware from default enabled middlewares

See original GitHub issue

According RFC2396, section 3.2.1:

   The HTTP protocol does not place any a priori limit on the length of a URI.
   Servers MUST be able to handle the URI of any resource they serve, and
   SHOULD be able to handle URIs of unbounded length if they provide 
   GET-based forms that could generate such URIs. A server SHOULD 
   return 414 (Request-URI Too Long) status if a URI is longer than the server
   can handle (see section 10.4.15).

We have enabled by default scrapy.spidermiddlewares.urllength.UrlLengthMiddleware that has a default limit defined by URLLENGTH_LIMIT setting (that can be modified by in project settings) set to 2083. As mentioned here, the reason for this number is related to limits of Microsoft Internet Explorer to handle URIs longer than that.

This can cause problems to spiders that will skip requests of URIs longer than that. Certainly we can change URLLENGTH_LIMIT on these spiders, but sometimes is not easy to set the right value and we chose to set a higher number just to make the middleware happy. This is what I am doing in a real world project, but the solution doesn’t look good.

I know that we can or disable the middleware, or change the length limit, but I think it is smoother for the user not to have to worry about this artificial limit we have on Scrapy. We are not using Microsoft Internet Explorer, we don’t need this limit.

Some alternatives that I considered:

  • Remove UrlLengthMiddleware as a default enabled middlewares, so we don’t need to worry about that limit unless we really need to worry about that (I don’t know the exact use-case that required this limit, so keeping the middleware available may make sense);
  • Change the default value to a more reasonable (difficult to find a reasonable value)
  • Allow URLLENGTH_LIMIT = -1, and in this case, ignore the limit. This seems an easier change in the settings than modifying SPIDER_MIDDLEWARES setting

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:12 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
Gallaeciocommented, May 11, 2021

@rennerocha We need to add documentation and tests for it, but know that it turns out the existing code already disables the middleware if you set the setting to 0.

1reaction
kmikecommented, May 10, 2021

Shall we simply allow to set URLLENGTH_LIMIT to a value that effectively disables the middleware? Any preference? (-1, 0, None).

Yeah, why not? I think we’re using 0 for other settings as such value.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spider Middleware — Scrapy 2.7.1 documentation
If you want to disable a builtin middleware (the ones defined in SPIDER_MIDDLEWARES_BASE , and enabled by default) you must define it in ......
Read more >
Proper way to remove middleware from the Express stack?
Is there a canonical way to remove middleware added with app.use from the stack? It seems that it should be possible to just...
Read more >
Scrapy with https proxy - Google Groups
Hi, I am trying to use scrapy to access https web pages over a proxy and I have some problems getting it to...
Read more >
for notable python repositories - issuebase
Remove UrlLengthMiddleware from default enabled middlewares. 5/6/2021. enhancementgood first issuedocs · [Question] What method does Scrapy use to expire ...
Read more >
Intro to Web Scraping With Scrapy - ScrapeOps
Clean our data (ex. remove currency signs from prices); Format our data (ex. convert ... By default Scrapy has the following downloader middlewares...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found