UnicodeDecodeError in LxmlLinkExtractor
See original GitHub issueThe following links triggers a UnicodeDecodeError
exception when being extracted by LxmlLinkExtractor:
<a href="http://gostariadefazerinscriçãoposcursos,obrigada.">A link</a>
<a href="http://www.domain.org ">Another link</a>
>>> 'http://gostariadefazerinscriçãoposcursos,obrigada.'
'http://gostariadefazerinscri\xc3\xa7\xc3\xa3oposcursos,obrigada.'
>>> 'http://www.domain.org '
'http://www.domain.org\xc2\xa0'
Here is the traceback:
File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "build/bdist.macosx-10.10-x86_64/egg/Spider/spiders/spider.py", line 271, in parse_response
File "build/bdist.macosx-10.10-x86_64/egg/Spider/spiders/spider.py", line 293, in extract_requests_from_response
File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/linkextractors/lxmlhtml.py", line 108, in extract_links
all_links.extend(self._process_links(links))
File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractor.py", line 86, in _process_links
links = [x for x in links if self._link_allowed(x)]
File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractor.py", line 66, in _link_allowed
if self.allow_domains and not url_is_from_any_domain(parsed_url, self.allow_domains):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/url.py", line 23, in url_is_from_any_domain
return any(((host == d.lower()) or (host.endswith('.%s' % d.lower())) for d in domains))
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/url.py", line 23, in <genexpr>
return any(((host == d.lower()) or (host.endswith('.%s' % d.lower())) for d in domains))
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Issue Analytics
- State:
- Created 8 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Scrapy - UnicodeDecodeError: 'utf-8' codec can't decode byte
I am using Scrapy to scrape a website. Although the scraping process is working just fine. However, I'm getting error logs on the...
Read more >Link Extractors - Scrapy documentation - Read the Docs
LxmlLinkExtractor is the recommended link extractor with handy filtering options. It is implemented using lxml's robust HTMLParser.
Read more >Python safe_url_string Examples
... try: text = str_to_unicode(text, encoding) except UnicodeDecodeError: print(text) raise text ... File: lxmllinkextract.py Project: wxpjimmy/Crawler_Demo.
Read more >UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d
Traceback (most recent call last): File "Conditional.py", line 108, in module for line in file1: File "cp1252.py", line 23, in decode return ...
Read more >'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
UnicodeDecodeError : 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte solved in Django .The error is on the line ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi everyone, the page http://medicatriz.com.br/manual-tecnico-profissional-medicatriz/ is no longer available so I created a
html
file to recreate the bug, likeand run
scrapy shell ./test.html
to try to get the error, but I gotYou can see that I don’t get any link using
(allow_domains=(b'medicatriz.com.br',)
like @redapple comment, without theallow_domains
I can still get the link, but I cannot reproduce the error, I am working withProbably this issue can be closed due the fact we cannot currently reproduce the reported bug
The target page still has
I can reproduce this with scrapy 1.1.2 with Python 2.7:
but with Python 3, it’s a bit different, no links being extracted with allowed domains passed as bytes (which may not be a valid use):