`LxmlLinkExtractor` fails handling unicode netlocs in Python2
See original GitHub issueAffected version: dc1f9ad
Affected Python version: Python 2 only
Steps to reproduce:
>>> import scrapy.http
>>> response = scrapy.http.TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8')
>>> response.css('a::attr(href)').extract()
[u'http://foo\u263a']
>>> import scrapy.linkextractors
>>> extractor = scrapy.linkextractors.LinkExtractor()
>>> extractor.extract_links(response)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/lxmlhtml.py", line 111, in extract_links
all_links.extend(self._process_links(links))
File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/__init__.py", line 104, in _process_links
link.url = canonicalize_url(urlparse(link.url))
File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 354, in canonicalize_url
parse_url(url), encoding=encoding)
File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 298, in _safe_ParseResult
netloc = parts.netloc.encode('idna')
File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 164, in encode
result.append(ToASCII(label))
File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 76, in ToASCII
label = nameprep(label)
File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 21, in nameprep
newlabel.append(stringprep.map_table_b2(c))
File "/usr/lib64/python2.7/stringprep.py", line 197, in map_table_b2
b = unicodedata.normalize("NFKC", al)
TypeError: normalize() argument 2 must be unicode, not str
Issue Analytics
- State:
- Created 7 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
normalize() argument 2 must be unicode, not str · Issue #2304
I'm using LxmlLinkExtractor I don't know if it could change anything ? But I alway have the TypeError: normalize() argument 2 must be...
Read more >Solving Unicode Problems in Python 2.7 - Azavea
The first step toward solving your Unicode problem is to stop thinking of type< 'str'> as storing strings (that is, sequences of human-readable ......
Read more >Python scraping gives unicode error - Stack Overflow
Try. print soup.decode('utf-8', 'ignore').prettify(). This will parse the soup string ignoring all the characters it cannot comprehend.
Read more >CVE-2019-9636: urlsplit does not handle NFKC normalization
Applications that want to avoid this error should perform their own decomposition using unicodedata or transcode to ASCII via IDNA.
Read more >xpath-tutorial PDF - Scrapy Docs
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I tested scrapy 1.2.0 with https://github.com/scrapy/w3lib/pull/75 (https://github.com/scrapy/w3lib/pull/75/commits/10865d916b74f26e4eb59f60a4bc11b88b89d674) and it fixes the issue:
Working since (at least)
w3lib 1.17.0
andscrapy 1.4.0
:Python 3
Python 2