Article.parse() not parsing entire article body correctly from HTML!See original GitHub issue
When I download and parse an article, I’ll include one here from CNN, it stops at the “Read More” in the HTML and doesn’t parse the entire body. Similarly, it includes the highlights as part of the text, which I don’t think it should do.
I’m fairly certain I’m not doing anything wrong, and this happens in both the Python 2 and Python 3 versions.
I’ve also tested it on two different systems (macOS on my laptop) from different locations and IP addresses, and an Ubuntu server, on yet a different IP. So I’m fairly certain it’s got nothing to do with that.
How to reproduce:
>> from newspaper import Article >> a = Article("http://www.cnn.com/2017/12/06/politics/al-franken-replacement/index.html") >> a.download() >> a.parse() At this point, it prints out the following: >> print(a.text) Story highlights Democratic Gov. Mark Dayton would appoint a replacement if Franken resigns That would set up a special election in November 2018 (CNN) Should Sen. Al Franken decide to step down, his resignation would set up a gubernatorial appointment and open up a new Senate battleground in 2018. Minnesota Gov. Mark Dayton does not plan to get ahead of Franken's scheduled announcement Thursday, a senior Minnesota Democrat close to Dayton told CNN, but the governor's "expectation and hope is for Franken to resign." Should Franken step down, top names to replace him are Democratic Reps. Keith Ellison and Tim Walz, this official said. Another leading contender will be Lt. Gov. Tina Smith, a former chief of staff to Dayton. "Don't overlook Lt. Governor Smith," the official said. "She could be the perfect choice." Dayton, a former US senator, might also tap his former colleagues for advice in his pick, including Senate Minority Leader Chuck Schumer. "There will be an open line of communication," said the senior Democratic strategist. Read More
However, using the demo here:
If you paste in the article:
http://www.cnn.com/2017/12/06/politics/al-franken-replacement/index.html (here’s the activated link: http://newspaper-demo.herokuapp.com/articles/show?url_to_clean=http%3A%2F%2Fwww.cnn.com%2F2017%2F12%2F06%2Fpolitics%2Fal-franken-replacement%2Findex.html )
It parses the article correctly and displays the text I would expect. (I won’t paste it here but you can try it yourself.)
What am I doing wrong? The fact that this is consistent on two systems across multiple OS and python versions from different IP’s indicates to me I must either have a broken requirement or that I’m doing something incorrectly.
In : try: ...: html_string = ElementTree.tostring(article.clean_top_node) ...: except: ...: html_string = "Error converting html to string." ...: In : html_string Out: 'Error converting html to string.'
I downloaded and ran the demo on my laptop locally to test what’s happening differently, and I’m still getting the same result as when I do it inside ipython. What’s interesting is that the Article HTML at the bottom of the demo page on the official Heroku page as well as what is running on my local demo are quite different. Here’s what’s on mine:
b'<div class="l-container" gravityNodes="15" gravityScore="243.5"><div class="el__leafmedia el__leafmedia--storyhighlights"><div class="el__storyhighlights_wrapper"><div class="el__storyhighlights"><h3 class="el__headline">Story highlights</h3><ul class="el__storyhighlights__list"><li class="el__storyhighlights__item el__storyhighlights--normal">Democratic Gov. Mark Dayton
and on the official demo:
<div gravityNodes="15" gravityScore="240"><p class="zn-body__paragraph speakable">Minnesota Gov. Mark Dayton
So it almost looks like an incongruency with how the HTML is being parsed. Could that be a tooling issue or a library issue of some kind? I just uninstalled and rerified on OS X that I have all the proper libraries installed/updated and I uninstalled/reinstalled newspaper3k, still the same result.
I really need to get this fixed as I’m trying to build a dataset for machine learning that I just realized might not be working properly. Help would be amazing.
- Created 5 years ago
Top GitHub Comments
This issue is still present in python3 version of this library. I have reproduced it on CNN website
Do we know if there is a fix for this in Python3? I’m having the same issue when using newspaper3k.