Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy gives HTML response different from requests.get

See original GitHub issue

scrapy version: 1.1.2 python version: 2.7.12 platform: Mac OS X 10.11.6

The issue:

For the url given in the following minimum working example, the HTML text in the response from scrapy is different from the one obtained with requests.get. The latter seems to be the correct one. It seems scrapy somehow duplicates part of the response html. This does not happen for all sites.

See the attached file for the two different html files. Or you may run the following code to generate them.

import scrapy
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor
import requests

url = 'http://training.sac.net.cn/cms/flkcalone.htm?myId=4028d0ee57ec28180157f55059b87209'

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,zh-TW;q=0.2',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
    }


class Test(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        request = scrapy.Request(url=url, callback=self.parse, headers=headers)
        yield request

    def parse(self, response):
        with open('response_with_scrapy.html', 'w') as f:
            f.write(response.text.encode('utf-8'))


if __name__ == '__main__':
    with open('response_with_requests.html', 'w') as f:
        f.write(requests.get(url, headers=headers).text.encode('utf-8'))

    runner = CrawlerRunner()
    runner.crawl(Test)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    reactor.run()

two_responses.zip

Issue Analytics

State:
Created 7 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

2reactions

saaiprakash1commented, Dec 11, 2019

requests==2.22.0 Scrapy 1.8.0 Python 3.5.2

same issue for me also The HTML text in the response from scrapy.Request is different from the one obtained with requests.get

error —>> “Session Timed Out. Please Login Again!” in html page using requests.get I am getting data in html

0reactions

RaoTauqeerSajidcommented, Dec 12, 2019

Thanks for your reply

On Wed, Dec 11, 2019, 12:03 PM saaiprakash1 notifications@github.com wrote:

requests==2.22.0 Scrapy 1.8.0 Python 3.5.2

same issue for me also The HTML text in the response from scrapy.Request is different from the one obtained with requests.get

error —>> “Session Timed Out. Please Login Again!” in html page using requests.get I am getting data in html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scrapy/scrapy/issues/2431?email_source=notifications&email_token=AMC5QJB34TNAKS42TWWDKPDQYCGE7A5CNFSM4CY5XY7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGSDCSI#issuecomment-564408649, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMC5QJCT6MLZTVHSHUJ24VTQYCGE7ANCNFSM4CY5XY7A .