Scrapy fails to crawl emoji domains, raises idna.core.InvalidCodepoint
See original GitHub issueDescription
Scrapy fails to crawl emoji domains. Specifically, i❤.ws
Raises the following:
idna.core.InvalidCodepoint: Codepoint U+2764 at position 2 of 'i❤' not allowed
Steps to Reproduce
- Create a
CrawlSpider
withallowed_domains
including an emoji domain, andstart_urls
including an emoji domain. - Begin crawl with
scrapy crawl
Expected behavior: Crawl site just like any other domain; do not raise an exception.
Actual behavior: Immediately raises exception.
Reproduces how often: Fails every time in my environment.
Versions
Scrapy : 1.8.0
lxml : 4.4.2.0
libxml2 : 2.9.9
cssselect : 1.1.0
parsel : 1.5.2
w3lib : 1.21.0
Twisted : 19.10.0
Python : 3.8.0 (default, Dec 6 2019, 10:12:02) - [GCC 7.4.0]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019)
cryptography : 2.8
Platform : Linux-4.15.0-74-generic-x86_64-with-glibc2.27
Additional context
I used the following CrawlSpider
:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class EmojiCrawlSpider(CrawlSpider):
name = "emoji_test"
allowed_domains = ['xn--i-7iq.ws']
start_urls = ['https://xn--i-7iq.ws/']
rules = (
Rule(
LinkExtractor(),
callback="parse_items",
follow=True
),
)
def parse_items(self, response):
print(response.url)
Traceback:
Traceback (most recent call last):
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/python/log.py", line 103, in callWithLogger
return callWithContext({"system": lp}, func, *args, **kw)
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/python/log.py", line 86, in callWithContext
return context.call({ILogContext: newCtx}, func, *args, **kw)
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/python/context.py", line 122, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/python/context.py", line 85, in callWithContext
return func(*args,**kw)
--- <exception caught here> ---
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/internet/posixbase.py", line 614, in _doReadOrWrite
why = selectable.doRead()
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/internet/tcp.py", line 243, in doRead
return self._dataReceived(data)
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/internet/tcp.py", line 249, in _dataReceived
rval = self.protocol.dataReceived(data)
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/internet/endpoints.py", line 132, in dataReceived
return self._wrappedProtocol.dataReceived(data)
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 148, in processProxyResponse
sslOptions = self._contextFactory.creatorForNetloc(self._tunneledHost, self._tunneledPort)
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/scrapy/core/downloader/contextfactory.py", line 61, in creatorForNetloc
return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext(),
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/scrapy/core/downloader/tls.py", line 50, in __init__
super(ScrapyClientTLSOptions, self).__init__(hostname, ctx)
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/internet/_sslverify.py", line 1174, in __init__
self._hostnameBytes = _idnaBytes(hostname)
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/internet/_idna.py", line 30, in _idnaBytes
return idna.encode(text)
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/idna/core.py", line 358, in encode
s = alabel(label)
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/idna/core.py", line 270, in alabel
ulabel(label)
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/idna/core.py", line 308, in ulabel
check_label(label)
File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/idna/core.py", line 261, in check_label
raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+2764 at position 2 of 'i❤' not allowed
Possibly related to https://github.com/scrapy/scrapy/issues/3321, but domain is valid and raises different Exception.
Issue Analytics
- State:
- Created 4 years ago
- Comments:13 (5 by maintainers)
Top Results From Across the Web
scrapy not scrape page if subdomain have underscore
The same error is thrown. raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label))) idna.core.
Read more >Release notes — Scrapy 2.7.1 documentation
In scrapy.core.engine.ExecutionEngine , methods crawl() , download() , schedule() , and spider_is_idle() now raise RuntimeError if called before ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
i❤.ws
works in a web browser. I think it should work in Scrapy as well.I’d caution that the main reason the the IDNA standard was revised was to remove support for symbols like emoji due to their security risks. Anyone who wants to perpetuate support for them should be explicitly opt-in to accept these security risks, and know what they are doing. Most registries ban registrations of emoji domains and they are overall being phased out. Substituting the older IDNA 2003 implementations will also be incompatible on some domains as some domains will convert differently in the current version.