question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scrapy xpath parses xml wrong

See original GitHub issue

Software version

Scrapy 1.3.0, Python 2.7.13, lxml==3.7.2

Summary

when I use scrapy shell test this page, I found that use response.xpath('//description/*') can not get the right content.

xml source at gist

Test steps

scrapy shell -s USER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.41 Safari/537.36' http://www.finishline.com/store/browse/gadgets/productLookupXML.jsp\?productId\=prod1120206

use scrapy xpath

In [0]: description = ''.join(response.xpath('//description/*').extract())

In [1]: print(description)
<p>Meet the Women's Nike Air Max Thea Mid Running Shoes. She is lighter than ever, durable as ever, and as comfortable as ever--and now, she comes in a mid-top silhouette. She is everything you could want in a running shoe and now she's better than ever with the addition of a seamless, molded leather upper. </p><p> In addition to the molded leather upper, the lelastic at the ankle provides a secure fit that is easy to put on and take off. The midsole features injected Phylon for a great cushiony feel and a visible Air-Sole unit to absorb shock, providing a forgiving, easeful feel in every foot strike. She'll go easy on you, but don't feel like you have to reciprocate. </p><p>FEATURES:</p><ul><li>UPPER: Molded leather </li> <li>MIDSOLE: Injected Phylon with Air-Sole unit <li>  <li>OUTSOLE: Rubber </li> IMPORTED</li></li>
        <salesText/>
        <bogo>false</bogo>
        <otherdetails/>
        <colors><color selected="true" colorId="859550-600" thumbnail="http://images.finishline.com/is/image/FinishLine/859550_600?$Thumbnail$">Night Maroon/Sail</color>
                                <color selected="false" colorId="859550-400" thumbnail="http://images.finishline.com/is/image/FinishLine/859550_400?$Thumbnail$">Obsidian/Sail/Bright Grape</color>
                                <color selected="false" colorId="859550-200" thumbnail="http://images.finishline.com/is/image/FinishLine/859550_200?$Thumbnail$">Ale Brown/Sail/Velvet Brown</color>
                                <color selected="false" colorId="859550-001" thumbnail="http://images.finishline.com/is/image/FinishLine/859550_001?$Thumbnail$">Black/Sail/Reflect Silver</color></colors>
        <sizes><size sku="2241540" colorId="859550-600" available="false">5.5</size>
                                <size sku="2241541" colorId="859550-600" available="true">6.0</size>
                                <size sku="2241542" colorId="859550-600" available="true">6.5</size>
                                <size sku="2241543" colorId="859550-600" available="true">7.0</size>
                                <size sku="2241544" colorId="859550-600" available="true">7.5</size>
                                <size sku="2241545" colorId="859550-600" available="true">8.0</size>
                                <size sku="2241546" colorId="859550-600" available="true">8.5</size>
                                <size sku="2241547" colorId="859550-600" available="true">9.0</size>
                                <size sku="2241548" colorId="859550-600" available="true">9.5</size>
                                <size sku="2241549" colorId="859550-600" available="true">10.0</size>
                                <size sku="2241550" colorId="859550-600" available="false">11.0</size>
                                <size sku="2241551" colorId="859550-400" available="false">5.5</size>
                                <size sku="2241552" colorId="859550-400" available="true">6.0</size>
                                <size sku="2241553" colorId="859550-400" available="true">6.5</size>
                                <size sku="2241554" colorId="859550-400" available="true">7.0</size>
                                <size sku="2241555" colorId="859550-400" available="true">7.5</size>
                                <size sku="2241556" colorId="859550-400" available="true">8.0</size>
                                <size sku="2241557" colorId="859550-400" available="true">8.5</size>
                                <size sku="2241558" colorId="859550-400" available="true">9.0</size>
                                <size sku="2241559" colorId="859550-400" available="true">9.5</size>
                                <size sku="2241560" colorId="859550-400" available="true">10.0</size>
                                <size sku="2241561" colorId="859550-400" available="false">11.0</size>
                                <size sku="2242952" colorId="859550-200" available="false">5.5</size>
                                <size sku="2242953" colorId="859550-200" available="true">6.0</size>
                                <size sku="2242954" colorId="859550-200" available="true">6.5</size>
                                <size sku="2242955" colorId="859550-200" available="true">7.0</size>
                                <size sku="2242956" colorId="859550-200" available="true">7.5</size>
                                <size sku="2242957" colorId="859550-200" available="true">8.0</size>
                                <size sku="2242958" colorId="859550-200" available="true">8.5</size>
                                <size sku="2242959" colorId="859550-200" available="true">9.0</size>
                                <size sku="2242960" colorId="859550-200" available="true">9.5</size>
                                <size sku="2242961" colorId="859550-200" available="true">10.0</size>
                                <size sku="2242962" colorId="859550-200" available="false">11.0</size>
                                <size sku="2242963" colorId="859550-001" available="false">5.5</size>
                                <size sku="2242964" colorId="859550-001" available="true">6.0</size>
                                <size sku="2242965" colorId="859550-001" available="true">6.5</size>
                                <size sku="2242966" colorId="859550-001" available="true">7.0</size>
                                <size sku="2242967" colorId="859550-001" available="true">7.5</size>
                                <size sku="2242968" colorId="859550-001" available="true">8.0</size>
                                <size sku="2242969" colorId="859550-001" available="true">8.5</size>
                                <size sku="2242970" colorId="859550-001" available="true">9.0</size>
                                <size sku="2242971" colorId="859550-001" available="false">9.5</size>
                                <size sku="2242972" colorId="859550-001" available="true">10.0</size>
                                <size sku="2242973" colorId="859550-001" available="false">11.0</size></sizes>
       <alternateviews><alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P1?$Thumbnail$</alternateviews>


						<alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P2?$Thumbnail_3quarter$</alternateviews>


						<alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P3?$Thumbnail_fb$</alternateviews>


						<alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P4?$Thumbnail$</alternateviews>


						<alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P5?$Thumbnail_fb$</alternateviews>


						<alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P6?$Thumbnail$</alternateviews>


						<alternateviews colorId="859550-600">http://images.finishline.com/is/image/FinishLine/859550_600_P7
?$Thumbnail$</alternateviews></alternateviews>
        <startDate/>
        <isShoe>true</isShoe>
        <ratingImage/>
        <totalReview/>
        <greyOutAddToCartButtons><greyOutAddToCartButton colorId="859550-600">false</greyOutAddToCartButton>
                       <greyOutAddToCartButton colorId="859550-400">false</greyOutAddToCartButton>
                       <greyOutAddToCartButton colorId="859550-200">false</greyOutAddToCartButton>
                       <greyOutAddToCartButton colorId="859550-001">false</greyOutAddToCartButton></greyOutAddToCartButtons>
    </ul>

use lxml xpath

In [4]: from lxml.html import tostring, fromstring

In [5]: test = fromstring(response.body)

In [10]: description2 = '\n'.join([tostring(i) for i in test.xpath('//description/*')])

In [11]: print(description2)
<p>Meet the Women's Nike Air Max Thea Mid Running Shoes. She is lighter than ever, durable as ever, and as comfortable as ever--and now, she comes in a mid-top silhouette. She is everything you could want in a running shoe and now she's better than ever with the addition of a seamless, molded leather upper. </p>
<p> In addition to the molded leather upper, the lelastic at the ankle provides a secure fit that is easy to put on and take off. The midsole features injected Phylon for a great cushiony feel and a visible Air-Sole unit to absorb shock, providing a forgiving, easeful feel in every foot strike. She'll go easy on you, but don't feel like you have to reciprocate. </p>
<p>FEATURES:</p>
<ul><li>UPPER: Molded leather </li> <li>MIDSOLE: Injected Phylon with Air-Sole unit </li><li>  </li><li>OUTSOLE: Rubber </li> IMPORTED</ul>

In [12]: test.xpath('//description/*')
Out[12]:
[<Element p at 0x111aafdb8>,
 <Element p at 0x111bcb100>,
 <Element p at 0x111bcb368>,
 <Element ul at 0x111bcb470>]

Update

use the same text from gist, TextResponse parses ok, XmlResponse parses wrong.

from scrapy.http import TextResponse, XmlResponse

url = 'http://test.com'
body = b''  # text from gist
response = TextResponse(url=url, body=body, encoding='utf-8')
responsex = XmlResponse(url=url, body=body, encoding='utf-8')
response.xpath('//description/*').extract()
responsex.xpath('//description/*').extract()

seems like it’s a bug of scrapy/parsel.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
redapplecommented, Feb 7, 2017

@light4 , you can even use this simpler pattern:

>>> r = response.replace(cls=scrapy.http.response.html.HtmlResponse)
>>> print(''.join(r.xpath('//description/*').extract()))
<p>Meet the Women's Nike Air Max Thea Mid Running Shoes. She is lighter than ever, durable as ever, and as comfortable as ever--and now, she comes in a mid-top silhouette. She is everything you could want in a running shoe and now she's better than ever with the addition of a seamless, molded leather upper. </p><p> In addition to the molded leather upper, the lelastic at the ankle provides a secure fit that is easy to put on and take off. The midsole features injected Phylon for a great cushiony feel and a visible Air-Sole unit to absorb shock, providing a forgiving, easeful feel in every foot strike. She'll go easy on you, but don't feel like you have to reciprocate. </p><p>FEATURES:</p><ul><li>UPPER: Molded leather </li> <li>MIDSOLE: Injected Phylon with Air-Sole unit </li><li>  </li><li>OUTSOLE: Rubber </li> IMPORTED</ul>

0reactions
light4commented, Feb 7, 2017

Thanks. That’s better.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy - Issue with xpath on an xml crawl - Stack Overflow
I think I am using xpath wrong but I'm not sure what I'm doing wrong. Spider from scrapy.contrib.spiders import XMLFeedSpider from crawler.
Read more >
Selectors — Scrapy 2.7.1 documentation
Selector automatically chooses the best parsing rules (XML vs HTML) based on input type. Using selectors¶. To explain how to use the selectors ......
Read more >
How to use Scrapy XPath? - eduCBA
Xml is a pythonic XML parsing library. Scrapy has a built-in data extraction mechanism. Because they select certain elements of HTML text indicated...
Read more >
The lxml.etree Tutorial
lxml.etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like ...
Read more >
Selectors - Scrapy documentation - Read the Docs
XPath is a language for selecting nodes in XML documents, which can also be used ... It automatically chooses the best parsing rules...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found