question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parser: book content's corrupted or not present: 9781098122836

See original GitHub issue

#] Parser: book content's corrupted or not present: node13-ch5.html (Chapter 5: Top 5 Developer-friendly Node.js API Frameworks)

however i can browse the page in browser without problem

https://learning.oreilly.com/library/view/nodejs-tools/9781098122836/Text/node13-ch5.html

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:10 (1 by maintainers)

github_iconTop GitHub Comments

5reactions
rknuuscommented, Feb 4, 2022

A dead ugly workaround is to download the failing file again and then to use a slightly different way to parse.

Funnily enough the object returned by the parser has the wrong type Element and must be converted to a HtmlElement to match the expectations of the code using it later on. For this I apply fromstring and tostring conversions, which is certainly not an efficient approach, but my lxml foo is simply too weak. In my case this code executes rarely enough and is fast enough so that I don’t care.

Because the whole thing is so cheesy and I don’t even understand the root cause, I don’t plan to create an MR. So the next best thing is to provide the patch below. To apply the patch store the patch into a file and apply it with git apply <patch file> onto the safaribooks git repo. If the patch fails to apply consider to checkout version af22b43c1 or a sufficiently compatible version and try again.

Limitation: Because I use path /tmp the hack will only work on *nix-based systems (incl. Macs), because I didn’t bother to use use StringIO or at least the pythonic temporary file module.

diff --git a/safaribooks.py b/safaribooks.py
index 1d23bee..461e2ef 100755
--- a/safaribooks.py
+++ b/safaribooks.py
@@ -605,6 +605,16 @@ class SafariBooks:
 
         return root
 
+    def download_html_to_file(self, url, file_name):
+        response = self.requests_provider(url)
+        if response == 0 or response.status_code != 200:
+            self.display.exit(
+                "Crawler: error trying to retrieve this page: %s (%s)\n    From: %s" %
+                (self.filename, self.chapter_title, url)
+            )
+        with open(file_name, 'w') as file:
+            file.write(response.text)
+
     @staticmethod
     def url_is_absolute(url):
         return bool(urlparse(url).netloc)
@@ -652,17 +662,27 @@ class SafariBooks:
 
         return None
 
-    def parse_html(self, root, first_page=False):
+    def parse_html(self, root, url, first_page=False):
         if random() > 0.8:
             if len(root.xpath("//div[@class='controls']/a/text()")):
                 self.display.exit(self.display.api_error(" "))
 
         book_content = root.xpath("//div[@id='sbo-rt-content']")
         if not len(book_content):
-            self.display.exit(
-                "Parser: book content's corrupted or not present: %s (%s)" %
-                (self.filename, self.chapter_title)
-            )
+            filename = '/tmp/ch.html'
+            self.download_html_to_file(url, filename)
+            parser = etree.HTMLParser()
+            tree = etree.parse(filename, parser)
+            book_content = tree.xpath("//div[@id='sbo-rt-content']")
+            if not len(book_content):
+                self.display.exit(
+                    "Parser: book content's corrupted or not present: %s (%s)" %
+                    (self.filename, self.chapter_title)
+                )
+            # KLUDGE(KNR): When parsing this way the resulting object has type Element
+            # instead of HtmlElement. So perform a crude conversion into the right type.
+            from lxml.html import fromstring, tostring
+            book_content[0] = html.fromstring(tostring(book_content[0]))
 
         page_css = ""
         if len(self.chapter_stylesheets):
@@ -846,7 +867,10 @@ class SafariBooks:
                     self.display.book_ad_info = 2
 
             else:
-                self.save_page_html(self.parse_html(self.get_html(next_chapter["content"]), first_page))
+                chapter_ = next_chapter["content"]
+                html_ = self.get_html(chapter_)
+                parsed_page_ = self.parse_html(html_, chapter_, first_page)
+                self.save_page_html(parsed_page_)
 
             self.display.state(len_books, len_books - len(self.chapters_queue))
1reaction
glasslioncommented, Sep 27, 2021

Please upgrade lxml to the latest version.

In my case, lxml<=4.4.2 can’t parse html content contains mathematical unicode characters(https://stackoverflow.com/questions/69334692/lxml-can-not-parse-html-fragment-contains-certain-unicode-character )

Read more comments on GitHub >

github_iconTop Results From Across the Web

Not working on some books · Issue #29 - GitHub
(32 chapters) [#] Parser: book content's corrupted or not present: ch02.html (2. The Swift Programming Language) [+] Please delete all the ...
Read more >
safaribooks - Bountysource
Download and generate EPUB of your favorite books from Safari Books Online ... Parser: book content's corrupted or not present: node13-ch5.html (Chapter 5: ......
Read more >
Corrupted Chaos by Shain Rose - Goodreads
Corrupted Chaos book. Read 792 reviews from the world's largest community for readers. My enemy doesn't make the rules behind closed doors…Even if...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found