Scrapy finding lots of invalid links - could we have functionality to filter them out?
See original GitHub issueUsing this selector:
links = response.xpath('//a[@href]')
I picked up a lot of spaces, as well as random schemes like mailto:
, tel:
, and javascript:
, which makes me presume there might be a better function. If Link extractors merely are wrappers around the above though, it might be worth adding some kind of option to sanitize link urls. (Perhaps that functionality is buried somewhere in the source code or in a section of the docs I skimmed over?)
Anyway, more importantly than the above, I found the following nuggets in my data’s href fields, and methinks Scrapy should definitely provide some kind of built-in filtering mechanism for those:
Improperly closed tag (href is missing its closing bracket; likely an lxml bug):
https://so.dajie.com/job/search?positionFunction=120610&positionName=视频算法>
视频算法
</a>
</li>
<li>
<a id=
---
Same issue with a pretty quote:
https://asg.to/contentsPage.html?mcd=dGQRAGplZCkLCfbw”>サンプル動画はコチラ</a><br />
</dd>
<dt>2013.7.16</dt>
<dd><h3>プレミアム会員限定オリジナル動画新作配信!</h3>
アゲサゲのプレミアムメンバーに向けて完全オリジナル動画が更新されました!<br />今回はフリーターのりおちゃん!スレンダーな感じやすい体がそそります!
<a href=
---
PHP error:
<br />
<b>Warning</b>: Use of undefined constant url - assumed 'url' (this will throw an Error in a future version of PHP) in <b>/home/forge/jaidefinichon.com/public/wp-content/themes/jaidefinichon/header.php</b> on line <b>116</b><br />
https://jaidefinichon.com
Also, it’ve been finding urls with carriage returns, line feeds, tabs, and spaces at random location. Sometimes they’d be valid had they been url encoded. Other times, it’s basically a miracle that browsers accept the links as is. I’ve seen the latter characters before and after each of .
, ?
, =
, &
, and #
.
It might be an upstream problem in urlparse()
. For instance:
>>> url = '''https://www.booking.com/country
... .html?label=gen173nr-1FCAEoggI46AdIM1gEaGeIAQGYATG4AQfIAQ3YAQHoAQH4AQKIAgGoAgO4Av2s9OgFwAIB;sid=514ad6c12110c2103c6a2618a429af6d'''
>>> url
'https://www.booking.com/country\n.html?label=gen173nr-1FCAEoggI46AdIM1gEaGeIAQGYATG4AQfIAQ3YAQHoAQH4AQKIAgGoAgO4Av2s9OgFwAIB;sid=514ad6c12110c2103c6a2618a429af6d'
>>> from urllib.parse import urlparse
>>> urlparse(url)
ParseResult(scheme='https', netloc='www.booking.com', path='/country\n.html', params='', query='label=gen173nr-1FCAEoggI46AdIM1gEaGeIAQGYATG4AQfIAQ3YAQHoAQH4AQKIAgGoAgO4Av2s9OgFwAIB;sid=514ad6c12110c2103c6a2618a429af6d', fragment='')
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (9 by maintainers)
Top GitHub Comments
@kmike please take a look at the above PR. It seems to me like
safe_url_string
is indeed the place to handle this, since it’s the first operation applied to the URL in theRequest
constructor (https://github.com/scrapy/scrapy/blob/1.6/scrapy/http/request/__init__.py#L54-L58)scurl
is not (yet?) in use within Scrapy, right? I can draft a PR forsafe_url_string
.