question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Identical requests sent by Scrapy vs Requests module returning different status codes

See original GitHub issue

Description

Recently a spider I made to crawl cragislist for rental listings broke. When I checked the logs, it turns out that all of my requests were hitting http 403 error codes. Of course, I assumed the issue would be from not setting proper headers and not using proxies etc, so I went ahead and added auto user-agent and header rotation as well as adding in proxy servers. None of this helped. In a last-ditch effort, I wrote a simple GET request using the requests module. Well, somehow this default request ended up working on the same URLs with 200 status codes even though it’s the same IP address without any proxy servers or user-agents configured.

I don’t understand exactly how the request is sent out by Scrapy vs requests module, but even when I configured both to share the exact same request headers, one returns 403 error while the other returns 200. It seems I’m also not the only one to experience this weird result based on this StackOverflow post.

Steps to Reproduce

  1. Set up a default Scrapy spider with only default settings active.

  2. Install the latest version of requests and make a default GET request to any site using request.get("any site"). Get the headers used by this default request. For me it was:

GET / HTTP/1.1
Host: localhost:8080
User-Agent: python-requests/2.25.1
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
  1. Configure the headers of the Scrapy spider request call to have the exact same headers from step 2.
                scrapy.Request(
                    url="any website",
                    callback=self.parse,
                    headers={
                        "User-Agent": "python-requests/2.25.1",
                        "Accept-Encoding": "gzip, deflate",
                        "Accept": "*/*",
                        "Connection": "keep-alive"
                        })
  1. Start a Netcat server locally to make sure Scrapy and requests will send the same request object. I started mine on port 8080 with the command nc -l 8080. Now change the request URLs for both Scrapy and requests to http://localhost:8080. Run both and examine the results.

For me, I see the following from Netcat for the request sent with requests module:

GET / HTTP/1.1
Host: localhost:8080
User-Agent: python-requests/2.25.1
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

And I see the following from the Scrapy Spider’s request:

GET / HTTP/1.1
User-Agent: python-requests/2.25.1
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Accept-Language: en
Host: localhost:8080

So they should be sending the same request object with the same info.

  1. Change the url from http://localhost:8080 to https://lasvegas.craigslist.org/d/apts-housing-for-rent/search/apa. You should now see that for some reason the requests module request returns status code 200, while the spider’s request returns 403 forbidden error. If you check the response body for the 403 code, you should see something along the lines of:
# these are custom formatted log outputs from me

2021-01-08 20:26:02 [root] INFO: 
            Http error code 403 with response:
            ----------------------------------
            response headers: {b'Set-Cookie': [b'cl_b=4|5643cfbca785a2e77246555fdf34d45a3a666145|1610166362kVl2U;path=/;domain=.craigslist.org;expires=Fri, 01-Jan-2038 00:00:00 GMT'], b'Strict-Transport-Security': [b'max-age=63072000']}
            ----------------------------------
            original request headers: {b'User-Agent': [b'python-requests/2.25.1'], b'Accept-Encoding': [b'gzip, deflate'], b'Accept': [b'*/*'], b'Connection': [b'keep-alive'], b'Accept-Language': [b'en']}
            ----------------------------------
            body of response: This IP has been automatically blocked.
If you have questions, please email: blocks-b1607628794570390@craigslist.org

            ----------------------------------

Expected behavior: When sending seemingly identical requests to the same URL from the same IP address between a Scrapy request vs request module request, I expected both to return the same result with the same HTTP status code.

Actual behavior: The Scrapy request returns 403 forbidden while the requests module returns 200 OK.

Reproduces how often: 100% for me and another colleague in a different city and state.

Versions

Scrapy       : 2.1.0
lxml         : 4.6.1.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 20.3.0
Python       : 3.8.6 (default, Oct 10 2020, 07:54:55) - [GCC 5.4.0 20160609]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020)
cryptography : 3.2.1
Platform     : Linux-5.8.0-7630-generic-x86_64-with-glibc2.2.5

Additional context

I tried this with other sites and it works as intended. No difference between the two requests. However, for some reason, Craigslist is able to tell the two requests apart and identify one as coming from Scrapy, which automatically gets blocked.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:35 (14 by maintainers)

github_iconTop GitHub Comments

8reactions
wRARcommented, Jan 11, 2021

This is one of the websites where setting DOWNLOADER_CLIENT_TLS_METHOD=TLSv1.2 helps. Still not sure what is happening.

3reactions
orzelcommented, Aug 16, 2021

I have a similar problem on lot of different sites. All are protected by cloudflare. In all cases, requests.get() just works, and scrapy fails with 403. It works with chrome/firefox as well, not even a captcha or anything “visibly” related to cloudflare. It fails with curl (command line, not the php stuff).

I tried all of this together:

  • using exactly the same user agent
  • using exactly the same headers, in same order
  • tried all values for DOWNLOADER_CLIENT_TLS_METHOD
  • overloading start_requests() to perform requests.get(), checks it returns 200, and pass all relevant cookies to scrapy for next queries. There are indeed quite some related to cloudflare: cftoken, cfid, __cflb

And all of that fails. The last one i really wonder why, but hey…

My (wild) guess is the following: there are only two remaining points on which cloudflare can base its accept/deny policy:

  • the order of headers, especially the “Host” one. You can’t modify headers ordering precisely with scrapy.
  • case for headers. It’s common knowledge that they are case-insensitive in specs, but cloudflare could very well don’t care.

I tried, but failed, to create a DOWNLOADER_MIDDLEWARES that would use requests.get() to fetch the pages. Has anyone ever done this ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Requests and Responses — Scrapy 2.7.1 documentation
When some site returns cookies (in a response) those are stored in the cookies for that domain and will be sent again in...
Read more >
scrapy returns different status code from python-requests ...
While investigating the issue, it seems like scrapy fails (gets a 503 and captcha page) while the same request, with the same headers...
Read more >
Working with COOKIES and HEADERS in Python SCRAPY ...
Hey what's up guys, in this video we gonna learn how to use cookies and headers along with scrapy spider's requests or the...
Read more >
Easy web scraping with Scrapy
The main difference between Scrapy and other commonly used libraries, such as Requests / BeautifulSoup, is that it is opinionated, ...
Read more >
Scrapy - Requests and Responses
Scrapy - Requests and Responses, Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found