Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Article.parse() not parsing entire article body correctly from HTML!

See original GitHub issue

Overview

When I download and parse an article, I’ll include one here from CNN, it stops at the “Read More” in the HTML and doesn’t parse the entire body. Similarly, it includes the highlights as part of the text, which I don’t think it should do.

I’m fairly certain I’m not doing anything wrong, and this happens in both the Python 2 and Python 3 versions.

I’ve also tested it on two different systems (macOS on my laptop) from different locations and IP addresses, and an Ubuntu server, on yet a different IP. So I’m fairly certain it’s got nothing to do with that.

How to reproduce:

>> from newspaper import Article

>> a = Article("http://www.cnn.com/2017/12/06/politics/al-franken-replacement/index.html")
>> a.download()
>> a.parse()

At this point, it prints out the following:

>> print(a.text)

Story highlights Democratic Gov. Mark Dayton would appoint a replacement if Franken resigns

That would set up a special election in November 2018

(CNN) Should Sen. Al Franken decide to step down, his resignation would set up a gubernatorial appointment and open up a new Senate battleground in 2018.

Minnesota Gov. Mark Dayton does not plan to get ahead of Franken's scheduled announcement Thursday, a senior Minnesota Democrat close to Dayton told CNN, but the governor's "expectation and hope is for Franken to resign."

Should Franken step down, top names to replace him are Democratic Reps. Keith Ellison and Tim Walz, this official said. Another leading contender will be Lt. Gov. Tina Smith, a former chief of staff to Dayton.

"Don't overlook Lt. Governor Smith," the official said. "She could be the perfect choice."

Dayton, a former US senator, might also tap his former colleagues for advice in his pick, including Senate Minority Leader Chuck Schumer. "There will be an open line of communication," said the senior Democratic strategist.

Read More

However, using the demo here:

http://newspaper-demo.herokuapp.com/

If you paste in the article:

http://www.cnn.com/2017/12/06/politics/al-franken-replacement/index.html (here’s the activated link: http://newspaper-demo.herokuapp.com/articles/show?url_to_clean=http%3A%2F%2Fwww.cnn.com%2F2017%2F12%2F06%2Fpolitics%2Fal-franken-replacement%2Findex.html )

It parses the article correctly and displays the text I would expect. (I won’t paste it here but you can try it yourself.)

Resolution

What am I doing wrong? The fact that this is consistent on two systems across multiple OS and python versions from different IP’s indicates to me I must either have a broken requirement or that I’m doing something incorrectly.

In [9]: try:
   ...:     html_string = ElementTree.tostring(article.clean_top_node)
   ...: except:
   ...:     html_string = "Error converting html to string."
   ...:

In [10]: html_string
Out[10]: 'Error converting html to string.'

EDIT

I downloaded and ran the demo on my laptop locally to test what’s happening differently, and I’m still getting the same result as when I do it inside ipython. What’s interesting is that the Article HTML at the bottom of the demo page on the official Heroku page as well as what is running on my local demo are quite different. Here’s what’s on mine:

b'<div class="l-container" gravityNodes="15" gravityScore="243.5"><div class="el__leafmedia el__leafmedia--storyhighlights"><div class="el__storyhighlights_wrapper"><div class="el__storyhighlights"><h3 class="el__headline">Story highlights</h3><ul class="el__storyhighlights__list"><li class="el__storyhighlights__item el__storyhighlights--normal">Democratic Gov. Mark Dayton

and on the official demo:

<div gravityNodes="15" gravityScore="240"><p class="zn-body__paragraph speakable">Minnesota Gov. Mark Dayton

So it almost looks like an incongruency with how the HTML is being parsed. Could that be a tooling issue or a library issue of some kind? I just uninstalled and rerified on OS X that I have all the proper libraries installed/updated and I uninstalled/reinstalled newspaper3k, still the same result.

I really need to get this fixed as I’m trying to build a dataset for machine learning that I just realized might not be working properly. Help would be amazing.

Thanks! /h

Issue Analytics

State:
Created 6 years ago
Reactions:4
Comments:6

Top GitHub Comments

1reaction

bilaltahirzcommented, May 24, 2020

This issue is still present in python3 version of this library. I have reproduced it on CNN website

1reaction

kshitijsachancommented, Apr 20, 2020

Do we know if there is a fix for this in Python3? I’m having the same issue when using newspaper3k.

Top Results From Across the Web

Web Scraping with Python and newspaper3k lib does not ...

I get Article object and URL but everything else is ''. I have tried on different websites, but result is the same. Then...

8.2 Parsing HTML documents — HTML5 - W3C

This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are...

Parsing HTML: a guide to select the right library

Parsing HTML. The goal of this article is helping you to find the right library to process HTML: we consider Java, C#, Python,...

SyntaxError: JSON.parse: bad parsing - JavaScript | MDN

This string has to be valid JSON and will throw this error if incorrect syntax was encountered. Examples. JSON.parse() does not allow trailing...

13.2 Parsing HTML documents - HTML Standard - WhatWG

This error occurs if the parser encounters an attribute in a tag that already has an attribute with the same name. The parser...