Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`LxmlLinkExtractor` fails handling unicode netlocs in Python2

See original GitHub issue

Affected version: dc1f9ad

Affected Python version: Python 2 only

Steps to reproduce:

>>> import scrapy.http
>>> response = scrapy.http.TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8')
>>> response.css('a::attr(href)').extract()
[u'http://foo\u263a']
>>> import scrapy.linkextractors
>>> extractor = scrapy.linkextractors.LinkExtractor()
>>> extractor.extract_links(response)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/lxmlhtml.py", line 111, in extract_links
    all_links.extend(self._process_links(links))
  File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/__init__.py", line 104, in _process_links
    link.url = canonicalize_url(urlparse(link.url))
  File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 354, in canonicalize_url
    parse_url(url), encoding=encoding)
  File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 298, in _safe_ParseResult
    netloc = parts.netloc.encode('idna')
  File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 76, in ToASCII
    label = nameprep(label)
  File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 21, in nameprep
    newlabel.append(stringprep.map_table_b2(c))
  File "/usr/lib64/python2.7/stringprep.py", line 197, in map_table_b2
    b = unicodedata.normalize("NFKC", al)
TypeError: normalize() argument 2 must be unicode, not str

Issue Analytics

State:
Created 7 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

redapplecommented, Oct 13, 2016

I tested scrapy 1.2.0 with https://github.com/scrapy/w3lib/pull/75 (https://github.com/scrapy/w3lib/pull/75/commits/10865d916b74f26e4eb59f60a4bc11b88b89d674) and it fixes the issue:

$ scrapy version -v
Scrapy    : 1.2.0
lxml      : 3.6.4.0
libxml2   : 2.9.4
Twisted   : 16.4.1
Python    : 2.7.12 (default, Jul  1 2016, 15:12:24) - [GCC 5.4.0 20160609]
pyOpenSSL : 16.1.0 (OpenSSL 1.0.2g  1 Mar 2016)
Platform  : Linux-4.4.0-42-generic-x86_64-with-Ubuntu-16.04-xenial

$ pip freeze |grep w3lib
-e git+git@github.com:scrapy/w3lib.git@10865d916b74f26e4eb59f60a4bc11b88b89d674#egg=w3lib

$ scrapy shell
2016-10-13 10:59:02 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapybot)
>>> import scrapy.http
>>> response = scrapy.http.TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8')
>>> response.css('a::attr(href)').extract()
[u'http://foo\u263a']
>>> import scrapy.linkextractors
>>> extractor = scrapy.linkextractors.LinkExtractor()
>>> extractor.extract_links(response)
[Link(url='http://xn--foo-4s5a/', text=u'', fragment='', nofollow=False)]
>>>

0reactions

elacuestacommented, Dec 24, 2019

Working since (at least) w3lib 1.17.0 and scrapy 1.4.0:

Python 3

scrapy version -v
Scrapy    : 1.4.0
lxml      : 4.4.2.0
libxml2   : 2.9.9
cssselect : 1.1.0
parsel    : 1.5.2
w3lib     : 1.17.0
Twisted   : 19.10.0
Python    : 3.6.9 (default, Nov  7 2019, 10:44:02) - [GCC 8.3.0]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
Platform  : Linux-4.15.0-20-generic-x86_64-with-LinuxMint-19.1-tessa

In [1]: from scrapy.http import TextResponse 
   ...: from scrapy.linkextractors import LinkExtractor 
   ...: response = TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8') 
   ...: response.css('a::attr(href)').extract() 
   ...: extractor = LinkExtractor() 
   ...: extractor.extract_links(response)                                                                                                                                                                                                     
Out[1]: [Link(url='http://foo☺', text='', fragment='', nofollow=False)]

Python 2

scrapy version -v
Scrapy    : 1.4.0
lxml      : 4.4.2.0
libxml2   : 2.9.9
cssselect : 1.1.0
parsel    : 1.5.2
w3lib     : 1.17.0
Twisted   : 19.10.0
Python    : 2.7.15+ (default, Oct  7 2019, 17:39:04) - [GCC 7.4.0]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
Platform  : Linux-4.15.0-20-generic-x86_64-with-LinuxMint-19.1-tessa

In [1]: from scrapy.http import TextResponse
   ...: from scrapy.linkextractors import LinkExtractor
   ...: response = TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8')
   ...: response.css('a::attr(href)').extract()
   ...: extractor = LinkExtractor()
   ...: extractor.extract_links(response)
   ...: 
Out[1]: [Link(url='http://foo\xe2\x98\xba', text=u'', fragment='', nofollow=False)]

Top Results From Across the Web

normalize() argument 2 must be unicode, not str · Issue #2304

I'm using LxmlLinkExtractor I don't know if it could change anything ? But I alway have the TypeError: normalize() argument 2 must be...

Solving Unicode Problems in Python 2.7 - Azavea

The first step toward solving your Unicode problem is to stop thinking of type< 'str'> as storing strings (that is, sequences of human-readable ......

Python scraping gives unicode error - Stack Overflow

Try. print soup.decode('utf-8', 'ignore').prettify(). This will parse the soup string ignoring all the characters it cannot comprehend.

CVE-2019-9636: urlsplit does not handle NFKC normalization

Applications that want to avoid this error should perform their own decomposition using unicodedata or transcode to ASCII via IDNA.

xpath-tutorial PDF - Scrapy Docs

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide.