Scrapy fails to fetch request with invalid hostname
See original GitHub issueI have url with invalid hostname - it does not match IDNA standards. Scrapy fails with that.
scrapy fetch "https://mediaworld_it_api2.frosmo.com/?method=products&products=[%22747190%22]"
2018-07-06 11:53:09 [scrapy.core.scraper] ERROR: Error downloading <GET https://mediaworld_it_api2.frosmo.com/?method=products&products=[%22747190%22]>
Traceback (most recent call last):
File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/pawel/scrapy/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/home/pawel//scrapy/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
File "/home/pawel/scrapy/scrapy/core/downloader/handlers/__init__.py", line 65, in download_request
return handler.download_request(request, spider)
File "/home/pawel/scrapy/scrapy/core/downloader/handlers/http11.py", line 67, in download_request
return agent.download_request(request)
File "/home/pawelscrapy/scrapy/core/downloader/handlers/http11.py", line 331, in download_request
method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/web/client.py", line 1649, in request
endpoint = self._getEndpoint(parsedURI)
File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/web/client.py", line 1633, in _getEndpoint
return self._endpointFactory.endpointForURI(uri)
File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/web/client.py", line 1510, in endpointForURI
uri.port)
File "/home/pawel/scrapy/scrapy/core/downloader/contextfactory.py", line 59, in creatorForNetloc
return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext())
File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1152, in __init__
self._hostnameBytes = _idnaBytes(hostname)
File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/twisted/internet/_idna.py", line 30, in _idnaBytes
return idna.encode(text)
File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/idna/core.py", line 355, in encode
result.append(alabel(label))
File "/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/idna/core.py", line 265, in alabel
raise IDNAError('The label {0} is not a valid A-label'.format(label))
IDNAError: The label mediaworld_it_api2 is not a valid A-label
IDNA error is legitmate. This url https://mediaworld_it_api2.frosmo.com/?method=products&products=[%22747190%22] is not valid according to IDNA standard.
In [1]: x = "https://mediaworld_it_api2.frosmo.com/?method=products&products=[%22747190%22]"
In [2]: import idna
In [3]: idna.encode(x)
---------------------------------------------------------------------------
IDNAError Traceback (most recent call last)
<ipython-input-3-c97070e17b57> in <module>()
----> 1 idna.encode(x)
/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/idna/core.pyc in encode(s, strict, uts46, std3_rules, transitional)
353 trailing_dot = True
354 for label in labels:
--> 355 result.append(alabel(label))
356 if trailing_dot:
357 result.append(b'')
/home/pawel/.virtualenvs/scrapy/local/lib/python2.7/site-packages/idna/core.pyc in alabel(label)
263 ulabel(label)
264 except IDNAError:
--> 265 raise IDNAError('The label {0} is not a valid A-label'.format(label))
266 if not valid_label_length(label):
267 raise IDNAError('Label too long')
IDNAError: The label https://mediaworld_it_api2 is not a valid A-label
How should scrapy handle this url? Should we download it regardless of validity?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:10 (6 by maintainers)
Top Results From Across the Web
Invalid Hostname error in scrapy - python - Stack Overflow
I get an error: twisted.python.failure.Failure exceptions.ValueError: invalid hostname: r2---sn-ug5onuxaxjvh-n8vs.c.pack.google.com.
Read more >master PDF - Scrapy Documentation
Scrapy (/skrepa/) is an application framework for crawling web sites and extracting structured data which can be used.
Read more >Scrapy - Shell - GeeksforGeeks
In the example below, we have a valid URL, and an invalid one. Depending on the nature of the request, the fetch displays...
Read more >Use Scrapy to Extract Data From HTML Tags - Linode
By default Scrapy parses only successful HTTP requests; all errors are excluded from parsing. To collect the broken links, the 404 responses ...
Read more >Web Scraping with Python: Everything you need to know (2022)
From Requests to BeautifulSoup, Scrapy, Selenium and more. ... In our case GET , indicating that we would like to fetch data.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@pawelmhm Great thanks to your answer!
Actually I have no choice but to fix this bug by modifcation of the following code. you can find it in path: python2.7/site-packages/idna/core.py
I just replaced
with
Bests,
I think yes, Scrapy should download it regardless of hostname validity. Our gold standard is a browser - if common browsers can download something, Scrapy should be able to do it as well.