Scrapy gives HTML response different from requests.get
See original GitHub issuescrapy version: 1.1.2 python version: 2.7.12 platform: Mac OS X 10.11.6
The issue:
For the url given in the following minimum working example, the HTML text in the response from scrapy is different from the one obtained with requests.get
. The latter seems to be the correct one. It seems scrapy somehow duplicates part of the response html. This does not happen for all sites.
See the attached file for the two different html files. Or you may run the following code to generate them.
import scrapy
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor
import requests
url = 'http://training.sac.net.cn/cms/flkcalone.htm?myId=4028d0ee57ec28180157f55059b87209'
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,zh-TW;q=0.2',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
}
class Test(scrapy.Spider):
name = 'test'
def start_requests(self):
request = scrapy.Request(url=url, callback=self.parse, headers=headers)
yield request
def parse(self, response):
with open('response_with_scrapy.html', 'w') as f:
f.write(response.text.encode('utf-8'))
if __name__ == '__main__':
with open('response_with_requests.html', 'w') as f:
f.write(requests.get(url, headers=headers).text.encode('utf-8'))
runner = CrawlerRunner()
runner.crawl(Test)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
Issue Analytics
- State:
- Created 7 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Requests and Responses — Scrapy 2.7.1 documentation
Represents an HTTP request, which is usually generated in a Spider and executed by the Downloader, thus generating a Response . Parameters.
Read more >Scrapy request get some responses, but not all - Stack Overflow
Seems that part of the html is dynamically loaded, so scrapy cannot see it. The data itself is present in a json-structure within...
Read more >Scrapy - Requests and Responses - Tutorialspoint
Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the...
Read more >Web Scraping with Python: Everything you need to know (2022)
From Requests to BeautifulSoup, Scrapy, Selenium and more. ... the different ways of performing HTTP requests with Python and extract the ...
Read more >Web Scraping in Python: Avoid Detection Like a Ninja
Web scraping without getting blocked using Python - or any other tool ... import requests response = requests.get('http://httpbin.org/ip') ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
requests==2.22.0 Scrapy 1.8.0 Python 3.5.2
same issue for me also The HTML text in the response from scrapy.Request is different from the one obtained with requests.get
error —>> “Session Timed Out. Please Login Again!” in html page using requests.get I am getting data in html
Thanks for your reply
On Wed, Dec 11, 2019, 12:03 PM saaiprakash1 notifications@github.com wrote: