Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parser: book content's corrupted or not present: 9781098122836

See original GitHub issue

#] Parser: book content's corrupted or not present: node13-ch5.html (Chapter 5: Top 5 Developer-friendly Node.js API Frameworks)

however i can browse the page in browser without problem

https://learning.oreilly.com/library/view/nodejs-tools/9781098122836/Text/node13-ch5.html

Issue Analytics

State:
Created 3 years ago
Comments:10 (1 by maintainers)

Top GitHub Comments

5reactions

rknuuscommented, Feb 4, 2022

A dead ugly workaround is to download the failing file again and then to use a slightly different way to parse.

Funnily enough the object returned by the parser has the wrong type Element and must be converted to a HtmlElement to match the expectations of the code using it later on. For this I apply fromstring and tostring conversions, which is certainly not an efficient approach, but my lxml foo is simply too weak. In my case this code executes rarely enough and is fast enough so that I don’t care.

Because the whole thing is so cheesy and I don’t even understand the root cause, I don’t plan to create an MR. So the next best thing is to provide the patch below. To apply the patch store the patch into a file and apply it with git apply <patch file> onto the safaribooks git repo. If the patch fails to apply consider to checkout version af22b43c1 or a sufficiently compatible version and try again.

Limitation: Because I use path /tmp the hack will only work on *nix-based systems (incl. Macs), because I didn’t bother to use use StringIO or at least the pythonic temporary file module.

diff --git a/safaribooks.py b/safaribooks.py
index 1d23bee..461e2ef 100755
--- a/safaribooks.py
+++ b/safaribooks.py
@@ -605,6 +605,16 @@ class SafariBooks:
 
         return root
 
+    def download_html_to_file(self, url, file_name):
+        response = self.requests_provider(url)
+        if response == 0 or response.status_code != 200:
+            self.display.exit(
+                "Crawler: error trying to retrieve this page: %s (%s)\n    From: %s" %
+                (self.filename, self.chapter_title, url)
+            )
+        with open(file_name, 'w') as file:
+            file.write(response.text)
+
     @staticmethod
     def url_is_absolute(url):
         return bool(urlparse(url).netloc)
@@ -652,17 +662,27 @@ class SafariBooks:
 
         return None
 
-    def parse_html(self, root, first_page=False):
+    def parse_html(self, root, url, first_page=False):
         if random() > 0.8:
             if len(root.xpath("//div[@class='controls']/a/text()")):
                 self.display.exit(self.display.api_error(" "))
 
         book_content = root.xpath("//div[@id='sbo-rt-content']")
         if not len(book_content):
-            self.display.exit(
-                "Parser: book content's corrupted or not present: %s (%s)" %
-                (self.filename, self.chapter_title)
-            )
+            filename = '/tmp/ch.html'
+            self.download_html_to_file(url, filename)
+            parser = etree.HTMLParser()
+            tree = etree.parse(filename, parser)
+            book_content = tree.xpath("//div[@id='sbo-rt-content']")
+            if not len(book_content):
+                self.display.exit(
+                    "Parser: book content's corrupted or not present: %s (%s)" %
+                    (self.filename, self.chapter_title)
+                )
+            # KLUDGE(KNR): When parsing this way the resulting object has type Element
+            # instead of HtmlElement. So perform a crude conversion into the right type.
+            from lxml.html import fromstring, tostring
+            book_content[0] = html.fromstring(tostring(book_content[0]))
 
         page_css = ""
         if len(self.chapter_stylesheets):
@@ -846,7 +867,10 @@ class SafariBooks:
                     self.display.book_ad_info = 2
 
             else:
-                self.save_page_html(self.parse_html(self.get_html(next_chapter["content"]), first_page))
+                chapter_ = next_chapter["content"]
+                html_ = self.get_html(chapter_)
+                parsed_page_ = self.parse_html(html_, chapter_, first_page)
+                self.save_page_html(parsed_page_)
 
             self.display.state(len_books, len_books - len(self.chapters_queue))

1reaction

glasslioncommented, Sep 27, 2021

Please upgrade lxml to the latest version.

In my case, lxml<=4.4.2 can’t parse html content contains mathematical unicode characters(https://stackoverflow.com/questions/69334692/lxml-can-not-parse-html-fragment-contains-certain-unicode-character )