question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy fails to fetch request with invalid hostname

See original GitHub issue

I have url with invalid hostname - it does not match IDNA standards. Scrapy fails with that.

scrapy fetch "https://mediaworld_it_api2.frosmo.com/?method=products&products=[%22747190%22]"

2018-07-06 11:53:09 [scrapy.core.scraper] ERROR: Error downloading <GET https://mediaworld_it_api2.frosmo.com/?method=products&products=[%22747190%22]>
Traceback (most recent call last):
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/pawel/scrapy/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/home/pawel//scrapy/scrapy/utils/defer.py", line 45, in mustbe_deferred
    result = f(*args, **kw)
  File "/home/pawel/scrapy/scrapy/core/downloader/handlers/__init__.py", line 65, in download_request
    return handler.download_request(request, spider)
  File "/home/pawel/scrapy/scrapy/core/downloader/handlers/http11.py", line 67, in download_request
    return agent.download_request(request)
  File "/home/pawelscrapy/scrapy/core/downloader/handlers/http11.py", line 331, in download_request
    method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/web/client.py", line 1649, in request
    endpoint = self._getEndpoint(parsedURI)
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/web/client.py", line 1633, in _getEndpoint
    return self._endpointFactory.endpointForURI(uri)
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/web/client.py", line 1510, in endpointForURI
    uri.port)
  File "/home/pawel/scrapy/scrapy/core/downloader/contextfactory.py", line 59, in creatorForNetloc
    return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext())
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1152, in __init__
    self._hostnameBytes = _idnaBytes(hostname)
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/internet/_idna.py", line 30, in _idnaBytes
    return idna.encode(text)
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/idna/core.py", line 355, in encode
    result.append(alabel(label))
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/idna/core.py", line 265, in alabel
    raise IDNAError('The label {0} is not a valid A-label'.format(label))
IDNAError: The label mediaworld_it_api2 is not a valid A-label

IDNA error is legitmate. This url https://mediaworld_it_api2.frosmo.com/?method=products&products=[%22747190%22] is not valid according to IDNA standard.

In [1]: x = "https://mediaworld_it_api2.frosmo.com/?method=products&products=[%22747190%22]"

In [2]: import idna

In [3]: idna.encode(x)
---------------------------------------------------------------------------
IDNAError                                 Traceback (most recent call last)
<ipython-input-3-c97070e17b57> in <module>()
----> 1 idna.encode(x)

/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/idna/core.pyc in encode(s, strict, uts46, std3_rules, transitional)
    353         trailing_dot = True
    354     for label in labels:
--> 355         result.append(alabel(label))
    356     if trailing_dot:
    357         result.append(b'')

/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/idna/core.pyc in alabel(label)
    263             ulabel(label)
    264         except IDNAError:
--> 265             raise IDNAError('The label {0} is not a valid A-label'.format(label))
    266         if not valid_label_length(label):
    267             raise IDNAError('Label too long')

IDNAError: The label https://mediaworld_it_api2 is not a valid A-label


How should scrapy handle this url? Should we download it regardless of validity?

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:2
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
Brysonweixincommented, Nov 6, 2019

@pawelmhm Great thanks to your answer!

Actually I have no choice but to fix this bug by modifcation of the following code. you can find it in path: python2.7/site-packages/idna/core.py

def check_label(label):

    if isinstance(label, (bytes, bytearray)):
        label = label.decode('utf-8')
    if len(label) == 0:
        raise IDNAError('Empty Label')

    check_nfc(label)
    check_hyphen_ok(label)
    check_initial_combiner(label)

    for (pos, cp) in enumerate(label):
        cp_value = ord(cp)
        if intranges_contain(cp_value, idnadata.codepoint_classes['PVALID']):
            continue
        elif intranges_contain(cp_value, idnadata.codepoint_classes['CONTEXTJ']):
            try:
                if not valid_contextj(label, pos):
                    raise InvalidCodepointContext('Joiner {0} not allowed at position {1} in {2}'.format(
                        _unot(cp_value), pos+1, repr(label)))
            except ValueError:
                raise IDNAError('Unknown codepoint adjacent to joiner {0} at position {1} in {2}'.format(
                    _unot(cp_value), pos+1, repr(label)))
        elif intranges_contain(cp_value, idnadata.codepoint_classes['CONTEXTO']):
            if not valid_contexto(label, pos):
                raise InvalidCodepointContext('Codepoint {0} not allowed at position {1} in {2}'.format(_unot(cp_value), pos+1, repr(label)))
        else:
            raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))

    check_bidi(label)

I just replaced

raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))

with

logging.warning('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))

Bests,

3reactions
kmikecommented, Jul 6, 2018

I think yes, Scrapy should download it regardless of hostname validity. Our gold standard is a browser - if common browsers can download something, Scrapy should be able to do it as well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Invalid Hostname error in scrapy - python - Stack Overflow
I get an error: twisted.python.failure.Failure exceptions.ValueError: invalid hostname: r2---sn-ug5onuxaxjvh-n8vs.c.pack.google.com.
Read more >
master PDF - Scrapy Documentation
Scrapy (/skrepa/) is an application framework for crawling web sites and extracting structured data which can be used.
Read more >
Scrapy - Shell - GeeksforGeeks
In the example below, we have a valid URL, and an invalid one. Depending on the nature of the request, the fetch displays...
Read more >
Use Scrapy to Extract Data From HTML Tags - Linode
By default Scrapy parses only successful HTTP requests; all errors are excluded from parsing. To collect the broken links, the 404 responses ...
Read more >
Web Scraping with Python: Everything you need to know (2022)
From Requests to BeautifulSoup, Scrapy, Selenium and more. ... In our case GET , indicating that we would like to fetch data.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found