question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy fails to crawl emoji domains, raises idna.core.InvalidCodepoint

See original GitHub issue

Description

Scrapy fails to crawl emoji domains. Specifically, i❤.ws

Raises the following:

idna.core.InvalidCodepoint: Codepoint U+2764 at position 2 of 'i❤' not allowed

Steps to Reproduce

  1. Create a CrawlSpider with allowed_domains including an emoji domain, and start_urls including an emoji domain.
  2. Begin crawl with scrapy crawl

Expected behavior: Crawl site just like any other domain; do not raise an exception.

Actual behavior: Immediately raises exception.

Reproduces how often: Fails every time in my environment.

Versions

Scrapy       : 1.8.0
lxml         : 4.4.2.0
libxml2      : 2.9.9
cssselect    : 1.1.0
parsel       : 1.5.2
w3lib        : 1.21.0
Twisted      : 19.10.0
Python       : 3.8.0 (default, Dec  6 2019, 10:12:02) - [GCC 7.4.0]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
cryptography : 2.8
Platform     : Linux-4.15.0-74-generic-x86_64-with-glibc2.27

Additional context

I used the following CrawlSpider:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class EmojiCrawlSpider(CrawlSpider):
    name = "emoji_test"
    allowed_domains = ['xn--i-7iq.ws']

    start_urls = ['https://xn--i-7iq.ws/']

    rules = (
        Rule(
            LinkExtractor(),
            callback="parse_items",
            follow=True
        ),
    )

    def parse_items(self, response):
        print(response.url)

Traceback:

Traceback (most recent call last):
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/python/log.py", line 103, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/python/log.py", line 86, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/internet/posixbase.py", line 614, in _doReadOrWrite
    why = selectable.doRead()
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/internet/tcp.py", line 243, in doRead
    return self._dataReceived(data)
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/internet/tcp.py", line 249, in _dataReceived
    rval = self.protocol.dataReceived(data)
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/internet/endpoints.py", line 132, in dataReceived
    return self._wrappedProtocol.dataReceived(data)
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 148, in processProxyResponse
    sslOptions = self._contextFactory.creatorForNetloc(self._tunneledHost, self._tunneledPort)
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/scrapy/core/downloader/contextfactory.py", line 61, in creatorForNetloc
    return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext(),
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/scrapy/core/downloader/tls.py", line 50, in __init__
    super(ScrapyClientTLSOptions, self).__init__(hostname, ctx)
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/internet/_sslverify.py", line 1174, in __init__
    self._hostnameBytes = _idnaBytes(hostname)
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/twisted/internet/_idna.py", line 30, in _idnaBytes
    return idna.encode(text)
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/idna/core.py", line 358, in encode
    s = alabel(label)
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/idna/core.py", line 270, in alabel
    ulabel(label)
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/idna/core.py", line 308, in ulabel
    check_label(label)
  File "/home/john/Code/venv/scrapy/lib/python3.8/site-packages/idna/core.py", line 261, in check_label
    raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+2764 at position 2 of 'i❤' not allowed

Possibly related to https://github.com/scrapy/scrapy/issues/3321, but domain is valid and raises different Exception.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:13 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
Gallaeciocommented, Feb 21, 2020

i❤.ws works in a web browser. I think it should work in Scrapy as well.

1reaction
kjdcommented, Nov 16, 2022

I’d caution that the main reason the the IDNA standard was revised was to remove support for symbols like emoji due to their security risks. Anyone who wants to perpetuate support for them should be explicitly opt-in to accept these security risks, and know what they are doing. Most registries ban registrations of emoji domains and they are overall being phased out. Substituting the older IDNA 2003 implementations will also be incompatible on some domains as some domains will convert differently in the current version.

Read more comments on GitHub >

github_iconTop Results From Across the Web

scrapy not scrape page if subdomain have underscore
The same error is thrown. raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label))) idna.core.
Read more >
Release notes — Scrapy 2.7.1 documentation
In scrapy.core.engine.ExecutionEngine , methods crawl() , download() , schedule() , and spider_is_idle() now raise RuntimeError if called before ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found