Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't get attribute value which attribute name start with #?

See original GitHub issue

for example, the page source is bellow: <img src="http://www1.pcbaby.com.cn/images/blank.gif" #src="http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg" />

But I get next in scrapy: <img src="http://www1.pcbaby.com.cn/images/blank.gif" />

I can’t get the attribute “#src” in scrapy.

How can I get that? please help me! Hope your reply! thanks!

Issue Analytics

State:
Created 8 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

redapplecommented, Mar 15, 2016

lxml/libxml2 has trouble with this input, but html5lib and beautifulsoup can cope (although with some specificities on the XPath expression to use)

Plain lxml:

$ ipython
Python 2.7.9 (default, Apr  2 2015, 15:33:21) 

In [1]: t = '''<html><body><img src="http://www1.pcbaby.com.cn/images/blank.gif" #src="http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg" /></body></html>'''

In [2]: import lxml.html

In [3]: lxml_doc = lxml.html.fromstring(t)

In [4]: lxml_doc.xpath('//img/@*')
Out[4]: ['http://www1.pcbaby.com.cn/images/blank.gif']

Let’s try with html5lib

In [6]: from lxml.html import tostring, html5parser

In [7]: html5_doc = html5parser.document_fromstring(t)
/home/paul/.virtualenvs/scrapy10/local/lib/python2.7/site-packages/html5lib/ihatexml.py:254: DataLossWarning: Coercing non-XML name
  warnings.warn("Coercing non-XML name", DataLossWarning)

In [8]: html5_doc.xpath('//img/@*')
Out[8]: []

In [9]: html5_doc.xpath('//@*')
Out[9]: 
['http://www1.pcbaby.com.cn/images/blank.gif',
 'http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg']

In [15]: tostring(html5_doc)
Out[15]: '<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head></html:head><html:body><html:img src="http://www1.pcbaby.com.cn/images/blank.gif" U00023src="http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg"></html:img></html:body></html:html>'

It seems html5lib has the attribute accessible using “U00023src”, not “#src”. And you need to pass namespace

In [16]: html5_doc.xpath('//html:img/@*', namespaces={"html": "http://www.w3.org/1999/xhtml"})
Out[16]: 
['http://www1.pcbaby.com.cn/images/blank.gif',
 'http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg']

In [17]: html5_doc.xpath('//html:img/@U00023src', namespaces={"html": "http://www.w3.org/1999/xhtml"})
Out[17]: ['http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg']

Let’s try with beautifulsoup parser

In [18]: from lxml.html import soupparser

In [19]: bs_doc = soupparser.fromstring(t)
/home/paul/.virtualenvs/scrapy10/local/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))

In [20]: bs_doc.xpath('//img/@*')
Out[20]: 
['http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg',
 'http://www1.pcbaby.com.cn/images/blank.gif']

In [21]: bs_doc.xpath('//img/@#src')
---------------------------------------------------------------------------
XPathEvalError                            Traceback (most recent call last)
<ipython-input-21-2d48dd3d4138> in <module>()
----> 1 bs_doc.xpath('//img/@#src')

src/lxml/lxml.etree.pyx in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:61854)()

src/lxml/xpath.pxi in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:178516)()

src/lxml/xpath.pxi in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:177421)()

XPathEvalError: Invalid expression

In [22]: bs_doc.xpath('//img/@*[name()="#src"]')
Out[22]: ['http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg']

0reactions

cage1618commented, Mar 16, 2016

ok, thank you very much!

Top Results From Across the Web

get attribute name in addition to attribute value in xml

I tried to make a simple example, I can get the attribute values i.e. "myName", "myNextAttribute", and "blah", but I can't get the...

Element.getAttribute() - Web APIs | MDN

The getAttribute() method of the Element interface returns the value of a specified attribute on the element.

How to Access an Object Attribute Given the Attribute Name as ...

You first create a Gun class object. Second, you get its caliber and color. Since your object does not have an attribute called...

.attr() | jQuery API Documentation

Description: Get the value of an attribute for the first element in the set of matched elements. version added: 1.0.attr( attributeName ). attributeName....

HTML data-* Attribute - W3Schools

The attribute name should not contain any uppercase letters, and must be at least one character long after the prefix "data-" · The...