question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Python 3.5 - Unable to build DOM tree.

See original GitHub issue
File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:79801)
  File "src/lxml/parser.pxi", line 1799, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:116219)
  File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116494)
  File "src/lxml/parser.pxi", line 1700, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115040)
  File "src/lxml/parser.pxi", line 1040, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109165)
  File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103404)
  File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105058)
  File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:103967)
  File "<string>", line None
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1

With preceding:

encoding error : input conversion failed due to input error, bytes 0x21 0x00 0x00 0x00
encoding error : input conversion failed due to input error, bytes 0x44 0x00 0x00 0x00
I/O error : encoder error

Example:

class Scraper(Spider):
    def task_generator(self):
        urls = [
            'https://au.linkedin.com/directory/people-a/',
            'https://www.linkedin.com/directory/people-a/'
        ]
        for url in urls:
            yield Task('url', url=url)

    def task_url(self, grab, task):
        links = grab.doc('//div[@class="columns"]//ul/li[@class="content"]/a')


bot = Scraper()
bot.run()

That’s happened on some pages, perhaps lxml failed to detect correct encoding.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:23 (15 by maintainers)

github_iconTop GitHub Comments

2reactions
oiwncommented, Dec 8, 2022

Solution (assume you’re using virtualenv):

install libxml2 and libxslt using brew.

uninstall lxml

pip uninstall lxml

install lxml with statically linked dependencies

STATIC_DEPS=true pip install lxml --no-cache-dir

https://github.com/oiwn/grab-reproduce

this code run results:

(grab) ➜ oiwn@mylaptop  ~/projects/grab-reproduce git:(master) ✗ python github.py
/Users/oiwn/.virtualenvs/grab/lib/python3.5/site-packages/grab/deprecated.py:250: GrabDeprecationWarning: The `Grab.response` attribute is deprecated. Use `Grab.doc` instead.
  warn('The `Grab.response` attribute is deprecated. '
http://localhost:8000/showcases/virtual-reality
http://localhost:8000/showcases/software-defined-radio
http://localhost:8000/showcases/tools-for-open-source
http://localhost:8000/showcases/open-source-integrations
http://localhost:8000/showcases/serverless-architecture
http://localhost:8000/showcases/emoji
http://localhost:8000/showcases/web-application-frameworks
http://localhost:8000/showcases/hacking-minecraft
http://localhost:8000/showcases/web-accessibility
http://localhost:8000/showcases/github-browser-extensions
http://localhost:8000/showcases/great-for-new-contributors
http://localhost:8000/showcases/productivity-tools
http://localhost:8000/showcases/javascript-game-engines
http://localhost:8000/showcases/projects-that-power-github-for-mac
http://localhost:8000/showcases/game-engines
(grab) ➜ oiwn@mylaptop  ~/projects/grab-reproduce git:(master) ✗ python --version
Python 3.5.2

additional info

http://louistiao.me/posts/installing-lxml-on-mac-osx-1011-inside-a-virtualenv-with-pip/ http://lxml.de/build.html#building-lxml-on-macos-x

@rickwargo @Alex-Just

1reaction
oiwncommented, May 29, 2017

maybe report to upstream (lxml)?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Changelog — Python 3.5.9 documentation
This updates references to an installation path that was changed in 3.2a4, and undoes changed references to the build tree that were made...
Read more >
xml.dom.minidom — Minimal DOM implementation — Python ...
When you are finished with a DOM tree, you may optionally call the unlink() method to encourage early cleanup of the now-unneeded objects....
Read more >
xml.etree.ElementTree — The ElementTree XML API ...
This is a short tutorial for using xml.etree.ElementTree ( ET in short). The goal is to demonstrate some of the building blocks and...
Read more >
Changelog — Python 3.11.1 documentation
gh-95243: Mitigate the inherent race condition from using find_unused_port() in testSockName() by trying to find an unused port a few times before failing....
Read more >
What's New In Python 3.8 — Python 3.11.1 documentation
Now the build system always reads from Modules/Setup inside the source tree. People who want to customize that file are encouraged to maintain...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found