question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy finding lots of invalid links - could we have functionality to filter them out?

See original GitHub issue

Using this selector:

links = response.xpath('//a[@href]')

I picked up a lot of spaces, as well as random schemes like mailto:, tel:, and javascript:, which makes me presume there might be a better function. If Link extractors merely are wrappers around the above though, it might be worth adding some kind of option to sanitize link urls. (Perhaps that functionality is buried somewhere in the source code or in a section of the docs I skimmed over?)

Anyway, more importantly than the above, I found the following nuggets in my data’s href fields, and methinks Scrapy should definitely provide some kind of built-in filtering mechanism for those:

Improperly closed tag (href is missing its closing bracket; likely an lxml bug):

https://so.dajie.com/job/search?positionFunction=120610&positionName=视频算法>
                                        视频算法
                                    </a>
                                </li>
                                <li>
                                    <a id=

---
Same issue with a pretty quote:

https://asg.to/contentsPage.html?mcd=dGQRAGplZCkLCfbw”>サンプル動画はコチラ</a><br />
</dd>
	<dt>2013.7.16</dt>
	<dd><h3>プレミアム会員限定オリジナル動画新作配信!</h3>
	アゲサゲのプレミアムメンバーに向けて完全オリジナル動画が更新されました!<br />今回はフリーターのりおちゃん!スレンダーな感じやすい体がそそります!
	<a href=

---
PHP error:

<br />
<b>Warning</b>:  Use of undefined constant url - assumed 'url' (this will throw an Error in a future version of PHP) in <b>/home/forge/jaidefinichon.com/public/wp-content/themes/jaidefinichon/header.php</b> on line <b>116</b><br />
https://jaidefinichon.com

Also, it’ve been finding urls with carriage returns, line feeds, tabs, and spaces at random location. Sometimes they’d be valid had they been url encoded. Other times, it’s basically a miracle that browsers accept the links as is. I’ve seen the latter characters before and after each of ., ?, =, &, and #.

It might be an upstream problem in urlparse(). For instance:

>>> url = '''https://www.booking.com/country
... .html?label=gen173nr-1FCAEoggI46AdIM1gEaGeIAQGYATG4AQfIAQ3YAQHoAQH4AQKIAgGoAgO4Av2s9OgFwAIB;sid=514ad6c12110c2103c6a2618a429af6d'''
>>> url
'https://www.booking.com/country\n.html?label=gen173nr-1FCAEoggI46AdIM1gEaGeIAQGYATG4AQfIAQ3YAQHoAQH4AQKIAgGoAgO4Av2s9OgFwAIB;sid=514ad6c12110c2103c6a2618a429af6d'
>>> from urllib.parse import urlparse
>>> urlparse(url)
ParseResult(scheme='https', netloc='www.booking.com', path='/country\n.html', params='', query='label=gen173nr-1FCAEoggI46AdIM1gEaGeIAQGYATG4AQfIAQ3YAQHoAQH4AQKIAgGoAgO4Av2s9OgFwAIB;sid=514ad6c12110c2103c6a2618a429af6d', fragment='')

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:11 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
elacuestacommented, Jul 11, 2019

@kmike please take a look at the above PR. It seems to me like safe_url_string is indeed the place to handle this, since it’s the first operation applied to the URL in the Request constructor (https://github.com/scrapy/scrapy/blob/1.6/scrapy/http/request/__init__.py#L54-L58)

1reaction
elacuestacommented, Jul 11, 2019

scurl is not (yet?) in use within Scrapy, right? I can draft a PR for safe_url_string.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy Filter duplicates extracted URLS from webpage
OK, so i am using Scrapy. I am currently trying to scrape "snipplr.com/all/page" then extract urls in the page. I then filter the...
Read more >
Requests and Responses — Scrapy 2.7.1 documentation
There are some aspects of scraping, such as filtering out duplicate requests (see DUPEFILTER_CLASS ) or caching responses (see HTTPCACHE_POLICY ) ...
Read more >
Broken links checker with Python and Scrapy webcrawler
Python web crawler using Scrapy to check for broken links ... the csv report. so here is where can filter out only what...
Read more >
Advanced Python Web Scraping: Best Practices & Workarounds
There are multiple sites where you can find a list of free proxies to use (like this). Both requests and scrapy have functionalities...
Read more >
Web Scraping in Python using Scrapy (with multiple examples)
Note- We have created a free course for web scraping using BeautifulSoup library. You can check it out here- Introduction to Web Scraping ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found