Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unicode Link Extractor

See original GitHub issue

When using the following to extract all of the links from a response:

self.link_extractor = LinkExtractor()
...
links = self.link_extractor.extract_links(response)

On rare occasions, the following error is thrown:

2016-05-25 12:13:55,432 [root] [ERROR]  Error on http://detroit.curbed.com/2016/5/5/11605132/tiny-house-designer-show, traceback: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1203, in mainLoop
    self.runUntilCurrent()
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 825, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 393, in callback
    self._startRunCallbacks(result)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/var/www/html/DomainCrawler/DomainCrawler/spiders/hybrid_spider.py", line 223, in parse
    items.extend(self._extract_requests(response))
  File "/var/www/html/DomainCrawler/DomainCrawler/spiders/hybrid_spider.py", line 477, in _extract_requests
    links = self.link_extractor.extract_links(response)
  File "/usr/local/lib/python2.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 111, in extract_links
    all_links.extend(self._process_links(links))
  File "/usr/local/lib/python2.7/site-packages/scrapy/linkextractors/__init__.py", line 103, in _process_links
    link.url = canonicalize_url(urlparse(link.url))
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/url.py", line 85, in canonicalize_url
    parse_url(url), encoding=encoding)
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/url.py", line 46, in _safe_ParseResult
    to_native_str(parts.netloc.encode('idna')),
  File "/usr/local/lib/python2.7/encodings/idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "/usr/local/lib/python2.7/encodings/idna.py", line 73, in ToASCII
    raise UnicodeError("label empty or too long")
exceptions.UnicodeError: label empty or too long

I was able to find some information concerning the error from here. My question is: What is the best way to handle this? Even if there is one bad link in the response, I’d want all of the other good links to be extracted.

Issue Analytics

State:
Created 7 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

kmikecommented, Jun 6, 2016

I’m adding 1.1.1 milestone because this is a Scrapy 1.1 regression.

0reactions

redapplecommented, Jun 6, 2016

I believe we should fix it at canonicalize_url level, something like catching the exception and returning the URL as is if the encoding of the domain name using IDNA algorithms fail.

It’s a shame there’s not explicit exception for wrong label lengths (we can test the exception string but that feels hacky)

>>> from scrapy.utils.url import canonicalize_url
>>> canonicalize_url('http://www.'+'a'*63+'.com')
'http://www.aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.com/'
>>> canonicalize_url('http://www.'+'a'*64+'.com')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "scrapy/utils/url.py", line 85, in canonicalize_url
    parse_url(url), encoding=encoding)
  File "scrapy/utils/url.py", line 46, in _safe_ParseResult
    to_native_str(parts.netloc.encode('idna')),
  File "/home/paul/.virtualenvs/scrapydev/lib/python2.7/encodings/idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "/home/paul/.virtualenvs/scrapydev/lib/python2.7/encodings/idna.py", line 73, in ToASCII
    raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long
>>>

Top Results From Across the Web

Extract Unicode Range

This utility extracts a range of characters from Unicode text. It's free, gets the job done quickly, and it's entirely browser-based. Try it...

Decode or Encode Unicode Text - Online Toolz

Click now to Decode or Encode Unicode text. This page contains tools to convert/escape unicode text to entities and viseversa.

Unicode characters in URLs - Stack Overflow

For its part, Firefox displays the Unicode characters in its URL bar but sends them to the server percentage encoded.

UAX #44: Unicode Character Database

Parsers which extract and process these lines can algorithmically determine the default values for all code points. See @missing Conventions for details about ......

Unicode URLs & Tools, State of the Art | Visual SEO Studio

OK, browsers where able to accept Unicode URL paths, ... Search Engines have to recognize Unicode path, extract from them keyword related ...