Python 3.5 - Unable to build DOM tree.
See original GitHub issueFile "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:79801)
File "src/lxml/parser.pxi", line 1799, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:116219)
File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116494)
File "src/lxml/parser.pxi", line 1700, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115040)
File "src/lxml/parser.pxi", line 1040, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109165)
File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103404)
File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105058)
File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:103967)
File "<string>", line None
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1
With preceding:
encoding error : input conversion failed due to input error, bytes 0x21 0x00 0x00 0x00
encoding error : input conversion failed due to input error, bytes 0x44 0x00 0x00 0x00
I/O error : encoder error
Example:
class Scraper(Spider):
def task_generator(self):
urls = [
'https://au.linkedin.com/directory/people-a/',
'https://www.linkedin.com/directory/people-a/'
]
for url in urls:
yield Task('url', url=url)
def task_url(self, grab, task):
links = grab.doc('//div[@class="columns"]//ul/li[@class="content"]/a')
bot = Scraper()
bot.run()
That’s happened on some pages, perhaps lxml failed to detect correct encoding.
Issue Analytics
- State:
- Created 7 years ago
- Comments:23 (15 by maintainers)
Top Results From Across the Web
Changelog — Python 3.5.9 documentation
This updates references to an installation path that was changed in 3.2a4, and undoes changed references to the build tree that were made...
Read more >xml.dom.minidom — Minimal DOM implementation — Python ...
When you are finished with a DOM tree, you may optionally call the unlink() method to encourage early cleanup of the now-unneeded objects....
Read more >xml.etree.ElementTree — The ElementTree XML API ...
This is a short tutorial for using xml.etree.ElementTree ( ET in short). The goal is to demonstrate some of the building blocks and...
Read more >Changelog — Python 3.11.1 documentation
gh-95243: Mitigate the inherent race condition from using find_unused_port() in testSockName() by trying to find an unused port a few times before failing....
Read more >What's New In Python 3.8 — Python 3.11.1 documentation
Now the build system always reads from Modules/Setup inside the source tree. People who want to customize that file are encouraged to maintain...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Solution (assume you’re using virtualenv):
install
libxml2
andlibxslt
using brew.uninstall lxml
install lxml with statically linked dependencies
https://github.com/oiwn/grab-reproduce
this code run results:
additional info
http://louistiao.me/posts/installing-lxml-on-mac-osx-1011-inside-a-virtualenv-with-pip/ http://lxml.de/build.html#building-lxml-on-macos-x
@rickwargo @Alex-Just
maybe report to upstream (lxml)?