scrapy xpath parses xml wrong
See original GitHub issueSoftware version
Scrapy 1.3.0
, Python 2.7.13
, lxml==3.7.2
Summary
when I use scrapy shell test this page, I found that use response.xpath('//description/*')
can not get the right content.
xml source at gist
Test steps
scrapy shell -s USER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.41 Safari/537.36' http://www.finishline.com/store/browse/gadgets/productLookupXML.jsp\?productId\=prod1120206
use scrapy xpath
In [0]: description = ''.join(response.xpath('//description/*').extract())
In [1]: print(description)
<p>Meet the Women's Nike Air Max Thea Mid Running Shoes. She is lighter than ever, durable as ever, and as comfortable as ever--and now, she comes in a mid-top silhouette. She is everything you could want in a running shoe and now she's better than ever with the addition of a seamless, molded leather upper. </p><p> In addition to the molded leather upper, the lelastic at the ankle provides a secure fit that is easy to put on and take off. The midsole features injected Phylon for a great cushiony feel and a visible Air-Sole unit to absorb shock, providing a forgiving, easeful feel in every foot strike. She'll go easy on you, but don't feel like you have to reciprocate. </p><p>FEATURES:</p><ul><li>UPPER: Molded leather </li> <li>MIDSOLE: Injected Phylon with Air-Sole unit <li> <li>OUTSOLE: Rubber </li> IMPORTED</li></li>
<salesText/>
<bogo>false</bogo>
<otherdetails/>
<colors><color selected="true" colorId="859550-600" thumbnail="http://images.finishline.com/is/image/FinishLine/859550_600?$Thumbnail$">Night Maroon/Sail</color>
<color selected="false" colorId="859550-400" thumbnail="http://images.finishline.com/is/image/FinishLine/859550_400?$Thumbnail$">Obsidian/Sail/Bright Grape</color>
<color selected="false" colorId="859550-200" thumbnail="http://images.finishline.com/is/image/FinishLine/859550_200?$Thumbnail$">Ale Brown/Sail/Velvet Brown</color>
<color selected="false" colorId="859550-001" thumbnail="http://images.finishline.com/is/image/FinishLine/859550_001?$Thumbnail$">Black/Sail/Reflect Silver</color></colors>
<sizes><size sku="2241540" colorId="859550-600" available="false">5.5</size>
<size sku="2241541" colorId="859550-600" available="true">6.0</size>
<size sku="2241542" colorId="859550-600" available="true">6.5</size>
<size sku="2241543" colorId="859550-600" available="true">7.0</size>
<size sku="2241544" colorId="859550-600" available="true">7.5</size>
<size sku="2241545" colorId="859550-600" available="true">8.0</size>
<size sku="2241546" colorId="859550-600" available="true">8.5</size>
<size sku="2241547" colorId="859550-600" available="true">9.0</size>
<size sku="2241548" colorId="859550-600" available="true">9.5</size>
<size sku="2241549" colorId="859550-600" available="true">10.0</size>
<size sku="2241550" colorId="859550-600" available="false">11.0</size>
<size sku="2241551" colorId="859550-400" available="false">5.5</size>
<size sku="2241552" colorId="859550-400" available="true">6.0</size>
<size sku="2241553" colorId="859550-400" available="true">6.5</size>
<size sku="2241554" colorId="859550-400" available="true">7.0</size>
<size sku="2241555" colorId="859550-400" available="true">7.5</size>
<size sku="2241556" colorId="859550-400" available="true">8.0</size>
<size sku="2241557" colorId="859550-400" available="true">8.5</size>
<size sku="2241558" colorId="859550-400" available="true">9.0</size>
<size sku="2241559" colorId="859550-400" available="true">9.5</size>
<size sku="2241560" colorId="859550-400" available="true">10.0</size>
<size sku="2241561" colorId="859550-400" available="false">11.0</size>
<size sku="2242952" colorId="859550-200" available="false">5.5</size>
<size sku="2242953" colorId="859550-200" available="true">6.0</size>
<size sku="2242954" colorId="859550-200" available="true">6.5</size>
<size sku="2242955" colorId="859550-200" available="true">7.0</size>
<size sku="2242956" colorId="859550-200" available="true">7.5</size>
<size sku="2242957" colorId="859550-200" available="true">8.0</size>
<size sku="2242958" colorId="859550-200" available="true">8.5</size>
<size sku="2242959" colorId="859550-200" available="true">9.0</size>
<size sku="2242960" colorId="859550-200" available="true">9.5</size>
<size sku="2242961" colorId="859550-200" available="true">10.0</size>
<size sku="2242962" colorId="859550-200" available="false">11.0</size>
<size sku="2242963" colorId="859550-001" available="false">5.5</size>
<size sku="2242964" colorId="859550-001" available="true">6.0</size>
<size sku="2242965" colorId="859550-001" available="true">6.5</size>
<size sku="2242966" colorId="859550-001" available="true">7.0</size>
<size sku="2242967" colorId="859550-001" available="true">7.5</size>
<size sku="2242968" colorId="859550-001" available="true">8.0</size>
<size sku="2242969" colorId="859550-001" available="true">8.5</size>
<size sku="2242970" colorId="859550-001" available="true">9.0</size>
<size sku="2242971" colorId="859550-001" available="false">9.5</size>
<size sku="2242972" colorId="859550-001" available="true">10.0</size>
<size sku="2242973" colorId="859550-001" available="false">11.0</size></sizes>
<alternateviews><alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P1?$Thumbnail$</alternateviews>
<alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P2?$Thumbnail_3quarter$</alternateviews>
<alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P3?$Thumbnail_fb$</alternateviews>
<alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P4?$Thumbnail$</alternateviews>
<alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P5?$Thumbnail_fb$</alternateviews>
<alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P6?$Thumbnail$</alternateviews>
<alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P7
?$Thumbnail$</alternateviews></alternateviews>
<startDate/>
<isShoe>true</isShoe>
<ratingImage/>
<totalReview/>
<greyOutAddToCartButtons><greyOutAddToCartButton colorId="859550-600">false</greyOutAddToCartButton>
<greyOutAddToCartButton colorId="859550-400">false</greyOutAddToCartButton>
<greyOutAddToCartButton colorId="859550-200">false</greyOutAddToCartButton>
<greyOutAddToCartButton colorId="859550-001">false</greyOutAddToCartButton></greyOutAddToCartButtons>
</ul>
use lxml xpath
In [4]: from lxml.html import tostring, fromstring
In [5]: test = fromstring(response.body)
In [10]: description2 = '\n'.join([tostring(i) for i in test.xpath('//description/*')])
In [11]: print(description2)
<p>Meet the Women's Nike Air Max Thea Mid Running Shoes. She is lighter than ever, durable as ever, and as comfortable as ever--and now, she comes in a mid-top silhouette. She is everything you could want in a running shoe and now she's better than ever with the addition of a seamless, molded leather upper. </p>
<p> In addition to the molded leather upper, the lelastic at the ankle provides a secure fit that is easy to put on and take off. The midsole features injected Phylon for a great cushiony feel and a visible Air-Sole unit to absorb shock, providing a forgiving, easeful feel in every foot strike. She'll go easy on you, but don't feel like you have to reciprocate. </p>
<p>FEATURES:</p>
<ul><li>UPPER: Molded leather </li> <li>MIDSOLE: Injected Phylon with Air-Sole unit </li><li> </li><li>OUTSOLE: Rubber </li> IMPORTED</ul>
In [12]: test.xpath('//description/*')
Out[12]:
[<Element p at 0x111aafdb8>,
<Element p at 0x111bcb100>,
<Element p at 0x111bcb368>,
<Element ul at 0x111bcb470>]
Update
use the same text from gist, TextResponse
parses ok, XmlResponse
parses wrong.
from scrapy.http import TextResponse, XmlResponse
url = 'http://test.com'
body = b'' # text from gist
response = TextResponse(url=url, body=body, encoding='utf-8')
responsex = XmlResponse(url=url, body=body, encoding='utf-8')
response.xpath('//description/*').extract()
responsex.xpath('//description/*').extract()
seems like it’s a bug of scrapy/parsel.
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Scrapy - Issue with xpath on an xml crawl - Stack Overflow
I think I am using xpath wrong but I'm not sure what I'm doing wrong. Spider from scrapy.contrib.spiders import XMLFeedSpider from crawler.
Read more >Selectors — Scrapy 2.7.1 documentation
Selector automatically chooses the best parsing rules (XML vs HTML) based on input type. Using selectors¶. To explain how to use the selectors ......
Read more >How to use Scrapy XPath? - eduCBA
Xml is a pythonic XML parsing library. Scrapy has a built-in data extraction mechanism. Because they select certain elements of HTML text indicated...
Read more >The lxml.etree Tutorial
lxml.etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like ...
Read more >Selectors - Scrapy documentation - Read the Docs
XPath is a language for selecting nodes in XML documents, which can also be used ... It automatically chooses the best parsing rules...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@light4 , you can even use this simpler pattern:
Thanks. That’s better.