Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy finding lots of invalid links - could we have functionality to filter them out?

See original GitHub issue

Using this selector:

links = response.xpath('//a[@href]')

I picked up a lot of spaces, as well as random schemes like mailto:, tel:, and javascript:, which makes me presume there might be a better function. If Link extractors merely are wrappers around the above though, it might be worth adding some kind of option to sanitize link urls. (Perhaps that functionality is buried somewhere in the source code or in a section of the docs I skimmed over?)

Anyway, more importantly than the above, I found the following nuggets in my data’s href fields, and methinks Scrapy should definitely provide some kind of built-in filtering mechanism for those:

Improperly closed tag (href is missing its closing bracket; likely an lxml bug):

https://so.dajie.com/job/search?positionFunction=120610&positionName=视频算法>
                                        视频算法
                                    </a>
                                </li>
                                <li>
                                    <a id=

---
Same issue with a pretty quote:

https://asg.to/contentsPage.html?mcd=dGQRAGplZCkLCfbw”>サンプル動画はコチラ</a><br />
</dd>
	<dt>2013.7.16</dt>
	<dd><h3>プレミアム会員限定オリジナル動画新作配信！</h3>
	アゲサゲのプレミアムメンバーに向けて完全オリジナル動画が更新されました！<br />今回はフリーターのりおちゃん！スレンダーな感じやすい体がそそります！
	<a href=

---
PHP error:

<br />
<b>Warning</b>:  Use of undefined constant url - assumed 'url' (this will throw an Error in a future version of PHP) in <b>/home/forge/jaidefinichon.com/public/wp-content/themes/jaidefinichon/header.php</b> on line <b>116</b><br />
https://jaidefinichon.com

Also, it’ve been finding urls with carriage returns, line feeds, tabs, and spaces at random location. Sometimes they’d be valid had they been url encoded. Other times, it’s basically a miracle that browsers accept the links as is. I’ve seen the latter characters before and after each of ., ?, =, &, and #.

It might be an upstream problem in urlparse(). For instance:

>>> url = '''https://www.booking.com/country
... .html?label=gen173nr-1FCAEoggI46AdIM1gEaGeIAQGYATG4AQfIAQ3YAQHoAQH4AQKIAgGoAgO4Av2s9OgFwAIB;sid=514ad6c12110c2103c6a2618a429af6d'''
>>> url
'https://www.booking.com/country\n.html?label=gen173nr-1FCAEoggI46AdIM1gEaGeIAQGYATG4AQfIAQ3YAQHoAQH4AQKIAgGoAgO4Av2s9OgFwAIB;sid=514ad6c12110c2103c6a2618a429af6d'
>>> from urllib.parse import urlparse
>>> urlparse(url)
ParseResult(scheme='https', netloc='www.booking.com', path='/country\n.html', params='', query='label=gen173nr-1FCAEoggI46AdIM1gEaGeIAQGYATG4AQfIAQ3YAQHoAQH4AQKIAgGoAgO4Av2s9OgFwAIB;sid=514ad6c12110c2103c6a2618a429af6d', fragment='')

Issue Analytics

State:
Created 4 years ago
Comments:11 (9 by maintainers)

Top GitHub Comments

1reaction

elacuestacommented, Jul 11, 2019

@kmike please take a look at the above PR. It seems to me like safe_url_string is indeed the place to handle this, since it’s the first operation applied to the URL in the Request constructor (https://github.com/scrapy/scrapy/blob/1.6/scrapy/http/request/__init__.py#L54-L58)

1reaction

elacuestacommented, Jul 11, 2019

scurl is not (yet?) in use within Scrapy, right? I can draft a PR for safe_url_string.

Top Results From Across the Web

Scrapy Filter duplicates extracted URLS from webpage

OK, so i am using Scrapy. I am currently trying to scrape "snipplr.com/all/page" then extract urls in the page. I then filter the...

Requests and Responses — Scrapy 2.7.1 documentation

There are some aspects of scraping, such as filtering out duplicate requests (see DUPEFILTER_CLASS ) or caching responses (see HTTPCACHE_POLICY ) ...

Broken links checker with Python and Scrapy webcrawler

Python web crawler using Scrapy to check for broken links ... the csv report. so here is where can filter out only what...

Advanced Python Web Scraping: Best Practices & Workarounds

There are multiple sites where you can find a list of free proxies to use (like this). Both requests and scrapy have functionalities...

Web Scraping in Python using Scrapy (with multiple examples)

Note- We have created a free course for web scraping using BeautifulSoup library. You can check it out here- Introduction to Web Scraping ......

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Scrapy finding lots of invalid links - could we have functionality to filter them out?

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Installation errors, pip, python 3.7.4 x64

Fallback parser rules in ItemLoader - discussion for spider maintenance