question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unicode Link Extractor

See original GitHub issue

When using the following to extract all of the links from a response:

self.link_extractor = LinkExtractor()
...
links = self.link_extractor.extract_links(response)

On rare occasions, the following error is thrown:

2016-05-25 12:13:55,432 [root] [ERROR]  Error on http://detroit.curbed.com/2016/5/5/11605132/tiny-house-designer-show, traceback: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1203, in mainLoop
    self.runUntilCurrent()
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 825, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 393, in callback
    self._startRunCallbacks(result)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/var/www/html/DomainCrawler/DomainCrawler/spiders/hybrid_spider.py", line 223, in parse
    items.extend(self._extract_requests(response))
  File "/var/www/html/DomainCrawler/DomainCrawler/spiders/hybrid_spider.py", line 477, in _extract_requests
    links = self.link_extractor.extract_links(response)
  File "/usr/local/lib/python2.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 111, in extract_links
    all_links.extend(self._process_links(links))
  File "/usr/local/lib/python2.7/site-packages/scrapy/linkextractors/__init__.py", line 103, in _process_links
    link.url = canonicalize_url(urlparse(link.url))
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/url.py", line 85, in canonicalize_url
    parse_url(url), encoding=encoding)
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/url.py", line 46, in _safe_ParseResult
    to_native_str(parts.netloc.encode('idna')),
  File "/usr/local/lib/python2.7/encodings/idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "/usr/local/lib/python2.7/encodings/idna.py", line 73, in ToASCII
    raise UnicodeError("label empty or too long")
exceptions.UnicodeError: label empty or too long

I was able to find some information concerning the error from here. My question is: What is the best way to handle this? Even if there is one bad link in the response, I’d want all of the other good links to be extracted.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
kmikecommented, Jun 6, 2016

I’m adding 1.1.1 milestone because this is a Scrapy 1.1 regression.

0reactions
redapplecommented, Jun 6, 2016

I believe we should fix it at canonicalize_url level, something like catching the exception and returning the URL as is if the encoding of the domain name using IDNA algorithms fail.

It’s a shame there’s not explicit exception for wrong label lengths (we can test the exception string but that feels hacky)

>>> from scrapy.utils.url import canonicalize_url
>>> canonicalize_url('http://www.'+'a'*63+'.com')
'http://www.aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.com/'
>>> canonicalize_url('http://www.'+'a'*64+'.com')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "scrapy/utils/url.py", line 85, in canonicalize_url
    parse_url(url), encoding=encoding)
  File "scrapy/utils/url.py", line 46, in _safe_ParseResult
    to_native_str(parts.netloc.encode('idna')),
  File "/home/paul/.virtualenvs/scrapydev/lib/python2.7/encodings/idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "/home/paul/.virtualenvs/scrapydev/lib/python2.7/encodings/idna.py", line 73, in ToASCII
    raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long
>>> 
Read more comments on GitHub >

github_iconTop Results From Across the Web

Extract Unicode Range
This utility extracts a range of characters from Unicode text. It's free, gets the job done quickly, and it's entirely browser-based. Try it...
Read more >
Decode or Encode Unicode Text - Online Toolz
Click now to Decode or Encode Unicode text. This page contains tools to convert/escape unicode text to entities and viseversa.
Read more >
Unicode characters in URLs - Stack Overflow
For its part, Firefox displays the Unicode characters in its URL bar but sends them to the server percentage encoded.
Read more >
UAX #44: Unicode Character Database
Parsers which extract and process these lines can algorithmically determine the default values for all code points. See @missing Conventions for details about ......
Read more >
Unicode URLs & Tools, State of the Art | Visual SEO Studio
OK, browsers where able to accept Unicode URL paths, ... Search Engines have to recognize Unicode path, extract from them keyword related ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found