Unicode Link Extractor
See original GitHub issueWhen using the following to extract all of the links from a response:
self.link_extractor = LinkExtractor()
...
links = self.link_extractor.extract_links(response)
On rare occasions, the following error is thrown:
2016-05-25 12:13:55,432 [root] [ERROR] Error on http://detroit.curbed.com/2016/5/5/11605132/tiny-house-designer-show, traceback: Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1203, in mainLoop
self.runUntilCurrent()
File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 825, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 393, in callback
self._startRunCallbacks(result)
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/var/www/html/DomainCrawler/DomainCrawler/spiders/hybrid_spider.py", line 223, in parse
items.extend(self._extract_requests(response))
File "/var/www/html/DomainCrawler/DomainCrawler/spiders/hybrid_spider.py", line 477, in _extract_requests
links = self.link_extractor.extract_links(response)
File "/usr/local/lib/python2.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 111, in extract_links
all_links.extend(self._process_links(links))
File "/usr/local/lib/python2.7/site-packages/scrapy/linkextractors/__init__.py", line 103, in _process_links
link.url = canonicalize_url(urlparse(link.url))
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/url.py", line 85, in canonicalize_url
parse_url(url), encoding=encoding)
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/url.py", line 46, in _safe_ParseResult
to_native_str(parts.netloc.encode('idna')),
File "/usr/local/lib/python2.7/encodings/idna.py", line 164, in encode
result.append(ToASCII(label))
File "/usr/local/lib/python2.7/encodings/idna.py", line 73, in ToASCII
raise UnicodeError("label empty or too long")
exceptions.UnicodeError: label empty or too long
I was able to find some information concerning the error from here. My question is: What is the best way to handle this? Even if there is one bad link in the response, I’d want all of the other good links to be extracted.
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Extract Unicode Range
This utility extracts a range of characters from Unicode text. It's free, gets the job done quickly, and it's entirely browser-based. Try it...
Read more >Decode or Encode Unicode Text - Online Toolz
Click now to Decode or Encode Unicode text. This page contains tools to convert/escape unicode text to entities and viseversa.
Read more >Unicode characters in URLs - Stack Overflow
For its part, Firefox displays the Unicode characters in its URL bar but sends them to the server percentage encoded.
Read more >UAX #44: Unicode Character Database
Parsers which extract and process these lines can algorithmically determine the default values for all code points. See @missing Conventions for details about ......
Read more >Unicode URLs & Tools, State of the Art | Visual SEO Studio
OK, browsers where able to accept Unicode URL paths, ... Search Engines have to recognize Unicode path, extract from them keyword related ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I’m adding 1.1.1 milestone because this is a Scrapy 1.1 regression.
I believe we should fix it at
canonicalize_url
level, something like catching the exception and returning the URL as is if the encoding of the domain name using IDNA algorithms fail.It’s a shame there’s not explicit exception for wrong label lengths (we can test the exception string but that feels hacky)