Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy fails to fetch request with invalid hostname

See original GitHub issue

I have url with invalid hostname - it does not match IDNA standards. Scrapy fails with that.

scrapy fetch "https://mediaworld_it_api2.frosmo.com/?method=products&products=[%22747190%22]"

2018-07-06 11:53:09 [scrapy.core.scraper] ERROR: Error downloading <GET https://mediaworld_it_api2.frosmo.com/?method=products&products=[%22747190%22]>
Traceback (most recent call last):
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/pawel/scrapy/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/home/pawel//scrapy/scrapy/utils/defer.py", line 45, in mustbe_deferred
    result = f(*args, **kw)
  File "/home/pawel/scrapy/scrapy/core/downloader/handlers/__init__.py", line 65, in download_request
    return handler.download_request(request, spider)
  File "/home/pawel/scrapy/scrapy/core/downloader/handlers/http11.py", line 67, in download_request
    return agent.download_request(request)
  File "/home/pawelscrapy/scrapy/core/downloader/handlers/http11.py", line 331, in download_request
    method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/web/client.py", line 1649, in request
    endpoint = self._getEndpoint(parsedURI)
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/web/client.py", line 1633, in _getEndpoint
    return self._endpointFactory.endpointForURI(uri)
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/web/client.py", line 1510, in endpointForURI
    uri.port)
  File "/home/pawel/scrapy/scrapy/core/downloader/contextfactory.py", line 59, in creatorForNetloc
    return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext())
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1152, in __init__
    self._hostnameBytes = _idnaBytes(hostname)
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/internet/_idna.py", line 30, in _idnaBytes
    return idna.encode(text)
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/idna/core.py", line 355, in encode
    result.append(alabel(label))
  File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/idna/core.py", line 265, in alabel
    raise IDNAError('The label {0} is not a valid A-label'.format(label))
IDNAError: The label mediaworld_it_api2 is not a valid A-label

IDNA error is legitmate. This url https://mediaworld_it_api2.frosmo.com/?method=products&products=[%22747190%22] is not valid according to IDNA standard.

In [1]: x = "https://mediaworld_it_api2.frosmo.com/?method=products&products=[%22747190%22]"

In [2]: import idna

In [3]: idna.encode(x)
---------------------------------------------------------------------------
IDNAError                                 Traceback (most recent call last)
<ipython-input-3-c97070e17b57> in <module>()
----> 1 idna.encode(x)

/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/idna/core.pyc in encode(s, strict, uts46, std3_rules, transitional)
    353         trailing_dot = True
    354     for label in labels:
--> 355         result.append(alabel(label))
    356     if trailing_dot:
    357         result.append(b'')

/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/idna/core.pyc in alabel(label)
    263             ulabel(label)
    264         except IDNAError:
--> 265             raise IDNAError('The label {0} is not a valid A-label'.format(label))
    266         if not valid_label_length(label):
    267             raise IDNAError('Label too long')

IDNAError: The label https://mediaworld_it_api2 is not a valid A-label

How should scrapy handle this url? Should we download it regardless of validity?

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:10 (6 by maintainers)

Top GitHub Comments

3reactions

Brysonweixincommented, Nov 6, 2019

@pawelmhm Great thanks to your answer!

Actually I have no choice but to fix this bug by modifcation of the following code. you can find it in path: python2.7/site-packages/idna/core.py

def check_label(label):

    if isinstance(label, (bytes, bytearray)):
        label = label.decode('utf-8')
    if len(label) == 0:
        raise IDNAError('Empty Label')

    check_nfc(label)
    check_hyphen_ok(label)
    check_initial_combiner(label)

    for (pos, cp) in enumerate(label):
        cp_value = ord(cp)
        if intranges_contain(cp_value, idnadata.codepoint_classes['PVALID']):
            continue
        elif intranges_contain(cp_value, idnadata.codepoint_classes['CONTEXTJ']):
            try:
                if not valid_contextj(label, pos):
                    raise InvalidCodepointContext('Joiner {0} not allowed at position {1} in {2}'.format(
                        _unot(cp_value), pos+1, repr(label)))
            except ValueError:
                raise IDNAError('Unknown codepoint adjacent to joiner {0} at position {1} in {2}'.format(
                    _unot(cp_value), pos+1, repr(label)))
        elif intranges_contain(cp_value, idnadata.codepoint_classes['CONTEXTO']):
            if not valid_contexto(label, pos):
                raise InvalidCodepointContext('Codepoint {0} not allowed at position {1} in {2}'.format(_unot(cp_value), pos+1, repr(label)))
        else:
            raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))

    check_bidi(label)

I just replaced

raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))

with

logging.warning('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))

Bests,

3reactions

kmikecommented, Jul 6, 2018

I think yes, Scrapy should download it regardless of hostname validity. Our gold standard is a browser - if common browsers can download something, Scrapy should be able to do it as well.