question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnicodeDecodeError in LxmlLinkExtractor

See original GitHub issue

The following links triggers a UnicodeDecodeError exception when being extracted by LxmlLinkExtractor:

<a href="http://gostariadefazerinscriçãoposcursos,obrigada.">A link</a> <a href="http://www.domain.org ">Another link</a>

>>> 'http://gostariadefazerinscriçãoposcursos,obrigada.'
'http://gostariadefazerinscri\xc3\xa7\xc3\xa3oposcursos,obrigada.'
>>> 'http://www.domain.org '
'http://www.domain.org\xc2\xa0'

Here is the traceback:

File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
 return (r for r in result or () if _filter(r))

File "build/bdist.macosx-10.10-x86_64/egg/Spider/spiders/spider.py", line 271, in parse_response        

File "build/bdist.macosx-10.10-x86_64/egg/Spider/spiders/spider.py", line 293, in extract_requests_from_response

File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/linkextractors/lxmlhtml.py", line 108, in extract_links
    all_links.extend(self._process_links(links))

File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractor.py", line 86, in _process_links
    links = [x for x in links if self._link_allowed(x)]

File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractor.py", line 66, in _link_allowed
    if self.allow_domains and not url_is_from_any_domain(parsed_url, self.allow_domains):

File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/url.py", line 23, in url_is_from_any_domain
    return any(((host == d.lower()) or (host.endswith('.%s' % d.lower())) for d in domains))

File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/url.py", line 23, in <genexpr>
    return any(((host == d.lower()) or (host.endswith('.%s' % d.lower())) for d in domains))

exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Jgaldoscommented, Aug 28, 2020

Hi everyone, the page http://medicatriz.com.br/manual-tecnico-profissional-medicatriz/ is no longer available so I created a html file to recreate the bug, like

<html>
    <head>
        <title>Runtime Error</title>
    </head>

    <body bgcolor="white">
		<div class="comment-body parent-cmnt">
            <span class="author"><a href='http://gostariadefazerinscriçãoposcursos,obrigada.' rel='external nofollow' class='url'>liliane</a></span>
            <br />
            <span class="date">29 maio 2014</span>
            <p>Comentário</p>
        </div>
    </body>
</html>

and run scrapy shell ./test.html to try to get the error, but I got

2020-08-21 14:45:40 [scrapy.core.engine] INFO: Spider opened
2020-08-21 14:45:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///Users/pp/Documents/Bitmaker/scrapy/test.html> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x1109bb2e8>
[s]   item       {}
[s]   request    <GET file:///Users/pp/Documents/Bitmaker/scrapy/test.html>
[s]   response   <200 file:///Users/pp/Documents/Bitmaker/scrapy/test.html>
[s]   settings   <scrapy.settings.Settings object at 0x1041ec438>
[s]   spider     <DefaultSpider 'default' at 0x110fc8080>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> from scrapy.linkextractors import LinkExtractor
>>> LinkExtractor(allow_domains=(b'medicatriz.com.br',)).extract_links(response)
[]
>>> LinkExtractor().extract_links(response)
[Link(url='http://xn--gostariadefazerinscrioposcursos,obrigada-uld1o.', text='liliane', fragment='', nofollow=True)]
>>> exit()

You can see that I don’t get any link using (allow_domains=(b'medicatriz.com.br',) like @redapple comment, without the allow_domains I can still get the link, but I cannot reproduce the error, I am working with

Scrapy       : 2.2.0
lxml         : 4.5.2.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 20.3.0
Python       : 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020)
cryptography : 2.9.2
Platform     : Darwin-17.7.0-x86_64-i386-64bit

Probably this issue can be closed due the fact we cannot currently reproduce the reported bug

0reactions
redapplecommented, Sep 14, 2016

The target page still has

            <div class="comment-body parent-cmnt">
                <span class="author"><a href='http://gostariadefazerinscriçãoposcursos,obrigada.' rel='external nofollow' class='url'>liliane</a></span>
                <br />
                <span class="date">29 maio 2014</span>
                <p>Comentário</p>
            </div>

I can reproduce this with scrapy 1.1.2 with Python 2.7:

$ scrapy version -v
Scrapy    : 1.1.2
lxml      : 3.6.4.0
libxml2   : 2.9.4
Twisted   : 16.4.0
Python    : 2.7.11+ (default, Apr 17 2016, 14:00:29) - [GCC 5.3.1 20160413]
pyOpenSSL : 16.1.0 (OpenSSL 1.0.2g-fips  1 Mar 2016)
Platform  : Linux-4.4.0-36-generic-x86_64-with-Ubuntu-16.04-xenial

(scrapy11.py2) $ scrapy shell http://medicatriz.com.br/manual-tecnico-profissional-medicatriz/
2016-09-14 16:21:25 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)
(...)
2016-09-14 16:21:25 [scrapy] INFO: Spider opened
2016-09-14 16:21:25 [scrapy] DEBUG: Crawled (200) <GET http://medicatriz.com.br/manual-tecnico-profissional-medicatriz/> (referer: None)
(...)

>>> from scrapy.linkextractors import LinkExtractor

>>> LinkExtractor(allow_domains=('medicatriz.com.br',)).extract_links(response)
[Link(url='http://medicatriz.com.br/', text=u'', fragment='', nofollow=False),
Link(url='http://medicatriz.com.br/quem-somos/', text='A Empresa', fragment='', nofollow=False),
...
Link(url='http://medicatriz.com.br/velox-removedor-de-cuticulas/', text=u'Velox, removedor de cut\xedculas', fragment='', nofollow=False),
Link(url='http://medicatriz.com.br/cell-plus-creme-para-massagem-1kg/', text=u'Cell Plus \u2013 Creme para Massagem', fragment='', nofollow=False)]

>>> LinkExtractor(allow_domains=(u'medicatriz.com.br',)).extract_links(response)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/paul/.virtualenvs/scrapy11.py2/local/lib/python2.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 111, in extract_links
    all_links.extend(self._process_links(links))
  File "/home/paul/.virtualenvs/scrapy11.py2/local/lib/python2.7/site-packages/scrapy/linkextractors/__init__.py", line 100, in _process_links
    links = [x for x in links if self._link_allowed(x)]
  File "/home/paul/.virtualenvs/scrapy11.py2/local/lib/python2.7/site-packages/scrapy/linkextractors/__init__.py", line 80, in _link_allowed
    if self.allow_domains and not url_is_from_any_domain(parsed_url, self.allow_domains):
  File "/home/paul/.virtualenvs/scrapy11.py2/local/lib/python2.7/site-packages/scrapy/utils/url.py", line 30, in url_is_from_any_domain
    return any((host == d) or (host.endswith('.%s' % d)) for d in domains)
  File "/home/paul/.virtualenvs/scrapy11.py2/local/lib/python2.7/site-packages/scrapy/utils/url.py", line 30, in <genexpr>
    return any((host == d) or (host.endswith('.%s' % d)) for d in domains)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
>>> 

but with Python 3, it’s a bit different, no links being extracted with allowed domains passed as bytes (which may not be a valid use):

$ scrapy version -v
Scrapy    : 1.1.2
lxml      : 3.6.4.0
libxml2   : 2.9.4
Twisted   : 16.4.0
Python    : 3.5.1+ (default, Mar 30 2016, 22:46:26) - [GCC 5.3.1 20160330]
pyOpenSSL : 16.1.0 (OpenSSL 1.0.2g-fips  1 Mar 2016)
Platform  : Linux-4.4.0-36-generic-x86_64-with-Ubuntu-16.04-xenial

$ scrapy shell http://medicatriz.com.br/manual-tecnico-profissional-medicatriz/
2016-09-14 16:35:16 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)
(...)
2016-09-14 16:35:17 [scrapy] DEBUG: Crawled (200) <GET http://medicatriz.com.br/manual-tecnico-profissional-medicatriz/> (referer: None)
(..)
>>> from scrapy.linkextractors import LinkExtractor
>>> LinkExtractor(allow_domains=('medicatriz.com.br',)).extract_links(response)
[Link(url='http://medicatriz.com.br/', text='', fragment='', nofollow=False),
 Link(url='http://medicatriz.com.br/quem-somos/', text='A Empresa', fragment='', nofollow=False),
 (...)]
>>> LinkExtractor(allow_domains=(b'medicatriz.com.br',)).extract_links(response)
[]
>>> 
Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy - UnicodeDecodeError: 'utf-8' codec can't decode byte
I am using Scrapy to scrape a website. Although the scraping process is working just fine. However, I'm getting error logs on the...
Read more >
Link Extractors - Scrapy documentation - Read the Docs
LxmlLinkExtractor is the recommended link extractor with handy filtering options. It is implemented using lxml's robust HTMLParser.
Read more >
Python safe_url_string Examples
... try: text = str_to_unicode(text, encoding) except UnicodeDecodeError: print(text) raise text ... File: lxmllinkextract.py Project: wxpjimmy/Crawler_Demo.
Read more >
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d
Traceback (most recent call last): File "Conditional.py", line 108, in module for line in file1: File "cp1252.py", line 23, in decode return ...
Read more >
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
UnicodeDecodeError : 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte solved in Django .The error is on the line ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found