question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`LxmlLinkExtractor` fails handling unicode netlocs in Python2

See original GitHub issue

Affected version: dc1f9ad

Affected Python version: Python 2 only

Steps to reproduce:

>>> import scrapy.http
>>> response = scrapy.http.TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8')
>>> response.css('a::attr(href)').extract()
[u'http://foo\u263a']
>>> import scrapy.linkextractors
>>> extractor = scrapy.linkextractors.LinkExtractor()
>>> extractor.extract_links(response)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/lxmlhtml.py", line 111, in extract_links
    all_links.extend(self._process_links(links))
  File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/__init__.py", line 104, in _process_links
    link.url = canonicalize_url(urlparse(link.url))
  File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 354, in canonicalize_url
    parse_url(url), encoding=encoding)
  File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 298, in _safe_ParseResult
    netloc = parts.netloc.encode('idna')
  File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 76, in ToASCII
    label = nameprep(label)
  File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 21, in nameprep
    newlabel.append(stringprep.map_table_b2(c))
  File "/usr/lib64/python2.7/stringprep.py", line 197, in map_table_b2
    b = unicodedata.normalize("NFKC", al)
TypeError: normalize() argument 2 must be unicode, not str

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
redapplecommented, Oct 13, 2016

I tested scrapy 1.2.0 with https://github.com/scrapy/w3lib/pull/75 (https://github.com/scrapy/w3lib/pull/75/commits/10865d916b74f26e4eb59f60a4bc11b88b89d674) and it fixes the issue:

$ scrapy version -v
Scrapy    : 1.2.0
lxml      : 3.6.4.0
libxml2   : 2.9.4
Twisted   : 16.4.1
Python    : 2.7.12 (default, Jul  1 2016, 15:12:24) - [GCC 5.4.0 20160609]
pyOpenSSL : 16.1.0 (OpenSSL 1.0.2g  1 Mar 2016)
Platform  : Linux-4.4.0-42-generic-x86_64-with-Ubuntu-16.04-xenial

$ pip freeze |grep w3lib
-e git+git@github.com:scrapy/w3lib.git@10865d916b74f26e4eb59f60a4bc11b88b89d674#egg=w3lib

$ scrapy shell
2016-10-13 10:59:02 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapybot)
>>> import scrapy.http
>>> response = scrapy.http.TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8')
>>> response.css('a::attr(href)').extract()
[u'http://foo\u263a']
>>> import scrapy.linkextractors
>>> extractor = scrapy.linkextractors.LinkExtractor()
>>> extractor.extract_links(response)
[Link(url='http://xn--foo-4s5a/', text=u'', fragment='', nofollow=False)]
>>> 
0reactions
elacuestacommented, Dec 24, 2019

Working since (at least) w3lib 1.17.0 and scrapy 1.4.0:

Python 3

scrapy version -v
Scrapy    : 1.4.0
lxml      : 4.4.2.0
libxml2   : 2.9.9
cssselect : 1.1.0
parsel    : 1.5.2
w3lib     : 1.17.0
Twisted   : 19.10.0
Python    : 3.6.9 (default, Nov  7 2019, 10:44:02) - [GCC 8.3.0]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
Platform  : Linux-4.15.0-20-generic-x86_64-with-LinuxMint-19.1-tessa
In [1]: from scrapy.http import TextResponse 
   ...: from scrapy.linkextractors import LinkExtractor 
   ...: response = TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8') 
   ...: response.css('a::attr(href)').extract() 
   ...: extractor = LinkExtractor() 
   ...: extractor.extract_links(response)                                                                                                                                                                                                     
Out[1]: [Link(url='http://foo☺', text='', fragment='', nofollow=False)]

Python 2

scrapy version -v
Scrapy    : 1.4.0
lxml      : 4.4.2.0
libxml2   : 2.9.9
cssselect : 1.1.0
parsel    : 1.5.2
w3lib     : 1.17.0
Twisted   : 19.10.0
Python    : 2.7.15+ (default, Oct  7 2019, 17:39:04) - [GCC 7.4.0]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
Platform  : Linux-4.15.0-20-generic-x86_64-with-LinuxMint-19.1-tessa
In [1]: from scrapy.http import TextResponse
   ...: from scrapy.linkextractors import LinkExtractor
   ...: response = TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8')
   ...: response.css('a::attr(href)').extract()
   ...: extractor = LinkExtractor()
   ...: extractor.extract_links(response)
   ...: 
Out[1]: [Link(url='http://foo\xe2\x98\xba', text=u'', fragment='', nofollow=False)]
Read more comments on GitHub >

github_iconTop Results From Across the Web

normalize() argument 2 must be unicode, not str · Issue #2304
I'm using LxmlLinkExtractor I don't know if it could change anything ? But I alway have the TypeError: normalize() argument 2 must be...
Read more >
Solving Unicode Problems in Python 2.7 - Azavea
The first step toward solving your Unicode problem is to stop thinking of type< 'str'> as storing strings (that is, sequences of human-readable ......
Read more >
Python scraping gives unicode error - Stack Overflow
Try. print soup.decode('utf-8', 'ignore').prettify(). This will parse the soup string ignoring all the characters it cannot comprehend.
Read more >
CVE-2019-9636: urlsplit does not handle NFKC normalization
Applications that want to avoid this error should perform their own decomposition using unicodedata or transcode to ASCII via IDNA.
Read more >
xpath-tutorial PDF - Scrapy Docs
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found