question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy gives HTML response different from requests.get

See original GitHub issue

scrapy version: 1.1.2 python version: 2.7.12 platform: Mac OS X 10.11.6

The issue:

For the url given in the following minimum working example, the HTML text in the response from scrapy is different from the one obtained with requests.get. The latter seems to be the correct one. It seems scrapy somehow duplicates part of the response html. This does not happen for all sites.

See the attached file for the two different html files. Or you may run the following code to generate them.

import scrapy
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor
import requests

url = 'http://training.sac.net.cn/cms/flkcalone.htm?myId=4028d0ee57ec28180157f55059b87209'

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,zh-TW;q=0.2',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
    }


class Test(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        request = scrapy.Request(url=url, callback=self.parse, headers=headers)
        yield request

    def parse(self, response):
        with open('response_with_scrapy.html', 'w') as f:
            f.write(response.text.encode('utf-8'))


if __name__ == '__main__':
    with open('response_with_requests.html', 'w') as f:
        f.write(requests.get(url, headers=headers).text.encode('utf-8'))

    runner = CrawlerRunner()
    runner.crawl(Test)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    reactor.run()

two_responses.zip

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
saaiprakash1commented, Dec 11, 2019

requests==2.22.0 Scrapy 1.8.0 Python 3.5.2

same issue for me also The HTML text in the response from scrapy.Request is different from the one obtained with requests.get

error —>> “Session Timed Out. Please Login Again!” in html page using requests.get I am getting data in html

0reactions
RaoTauqeerSajidcommented, Dec 12, 2019

Thanks for your reply

On Wed, Dec 11, 2019, 12:03 PM saaiprakash1 notifications@github.com wrote:

requests==2.22.0 Scrapy 1.8.0 Python 3.5.2

same issue for me also The HTML text in the response from scrapy.Request is different from the one obtained with requests.get

error —>> “Session Timed Out. Please Login Again!” in html page using requests.get I am getting data in html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scrapy/scrapy/issues/2431?email_source=notifications&email_token=AMC5QJB34TNAKS42TWWDKPDQYCGE7A5CNFSM4CY5XY7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGSDCSI#issuecomment-564408649, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMC5QJCT6MLZTVHSHUJ24VTQYCGE7ANCNFSM4CY5XY7A .

Read more comments on GitHub >

github_iconTop Results From Across the Web

Requests and Responses — Scrapy 2.7.1 documentation
Represents an HTTP request, which is usually generated in a Spider and executed by the Downloader, thus generating a Response . Parameters.
Read more >
Scrapy request get some responses, but not all - Stack Overflow
Seems that part of the html is dynamically loaded, so scrapy cannot see it. The data itself is present in a json-structure within...
Read more >
Scrapy - Requests and Responses - Tutorialspoint
Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the...
Read more >
Web Scraping with Python: Everything you need to know (2022)
From Requests to BeautifulSoup, Scrapy, Selenium and more. ... the different ways of performing HTTP requests with Python and extract the ...
Read more >
Web Scraping in Python: Avoid Detection Like a Ninja
Web scraping without getting blocked using Python - or any other tool ... import requests response = requests.get('http://httpbin.org/ip') ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found