question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't get attribute value which attribute name start with #?

See original GitHub issue

for example, the page source is bellow: <img src="http://www1.pcbaby.com.cn/images/blank.gif" #src="http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg" />

But I get next in scrapy: <img src="http://www1.pcbaby.com.cn/images/blank.gif" />

I can’t get the attribute “#src” in scrapy.

How can I get that? please help me! Hope your reply! thanks!

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
redapplecommented, Mar 15, 2016

lxml/libxml2 has trouble with this input, but html5lib and beautifulsoup can cope (although with some specificities on the XPath expression to use)

Plain lxml:

$ ipython
Python 2.7.9 (default, Apr  2 2015, 15:33:21) 

In [1]: t = '''<html><body><img src="http://www1.pcbaby.com.cn/images/blank.gif" #src="http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg" /></body></html>'''

In [2]: import lxml.html

In [3]: lxml_doc = lxml.html.fromstring(t)

In [4]: lxml_doc.xpath('//img/@*')
Out[4]: ['http://www1.pcbaby.com.cn/images/blank.gif']

Let’s try with html5lib

In [6]: from lxml.html import tostring, html5parser

In [7]: html5_doc = html5parser.document_fromstring(t)
/home/paul/.virtualenvs/scrapy10/local/lib/python2.7/site-packages/html5lib/ihatexml.py:254: DataLossWarning: Coercing non-XML name
  warnings.warn("Coercing non-XML name", DataLossWarning)

In [8]: html5_doc.xpath('//img/@*')
Out[8]: []

In [9]: html5_doc.xpath('//@*')
Out[9]: 
['http://www1.pcbaby.com.cn/images/blank.gif',
 'http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg']

In [15]: tostring(html5_doc)
Out[15]: '<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head></html:head><html:body><html:img src="http://www1.pcbaby.com.cn/images/blank.gif" U00023src="http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg"></html:img></html:body></html:html>'

It seems html5lib has the attribute accessible using “U00023src”, not “#src”. And you need to pass namespace

In [16]: html5_doc.xpath('//html:img/@*', namespaces={"html": "http://www.w3.org/1999/xhtml"})
Out[16]: 
['http://www1.pcbaby.com.cn/images/blank.gif',
 'http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg']

In [17]: html5_doc.xpath('//html:img/@U00023src', namespaces={"html": "http://www.w3.org/1999/xhtml"})
Out[17]: ['http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg']

Let’s try with beautifulsoup parser

In [18]: from lxml.html import soupparser

In [19]: bs_doc = soupparser.fromstring(t)
/home/paul/.virtualenvs/scrapy10/local/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))

In [20]: bs_doc.xpath('//img/@*')
Out[20]: 
['http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg',
 'http://www1.pcbaby.com.cn/images/blank.gif']

In [21]: bs_doc.xpath('//img/@#src')
---------------------------------------------------------------------------
XPathEvalError                            Traceback (most recent call last)
<ipython-input-21-2d48dd3d4138> in <module>()
----> 1 bs_doc.xpath('//img/@#src')

src/lxml/lxml.etree.pyx in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:61854)()

src/lxml/xpath.pxi in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:178516)()

src/lxml/xpath.pxi in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:177421)()

XPathEvalError: Invalid expression

In [22]: bs_doc.xpath('//img/@*[name()="#src"]')
Out[22]: ['http://img0.pcbaby.com.cn/pcbaby/1603/10/2799244_yunzaoqi0310-11.jpg']
0reactions
cage1618commented, Mar 16, 2016

ok, thank you very much!

Read more comments on GitHub >

github_iconTop Results From Across the Web

get attribute name in addition to attribute value in xml
I tried to make a simple example, I can get the attribute values i.e. "myName", "myNextAttribute", and "blah", but I can't get the...
Read more >
Element.getAttribute() - Web APIs | MDN
The getAttribute() method of the Element interface returns the value of a specified attribute on the element.
Read more >
How to Access an Object Attribute Given the Attribute Name as ...
You first create a Gun class object. Second, you get its caliber and color. Since your object does not have an attribute called...
Read more >
.attr() | jQuery API Documentation
Description: Get the value of an attribute for the first element in the set of matched elements. version added: 1.0.attr( attributeName ). attributeName....
Read more >
HTML data-* Attribute - W3Schools
The attribute name should not contain any uppercase letters, and must be at least one character long after the prefix "data-" · The...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found