Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incorrect article text is extracted for multiple articles on some domains.

See original GitHub issue

First off I would like to thank the creators for making this package free as it is a lifesaver and a timesaver. However, I’d like to address the issues I’m having with the extractor and perhaps find a workaround. My conda env has: newspaper3k=0.2.8=py37_0
The following is my sample article which is only extracting text multiple paragraphs below where the article actually begins: NYTIMES Sample. My extracted text begins with:

“In letters to state regulatory boards and in interviews…”

But it should begin with:

For Alyssa Watrous, the medication mix-up meant…

I’ve noticed this is the case for multiple articles on the nytimes website. I’ve just updated my packages, and that did not help. I would appreciate if anyone knows the source of these problems, I know fixing this package to correctly extract all websites perfectly may be unattainable but if there is a way I may look into fixing this myself. Below is my basic setup:

config = Config()  
article = Article(url, config=config) 
article.download()    
article.parse()    
article.nlp()

Issue Analytics

State:
Created 4 years ago
Comments:5

Top GitHub Comments

1reaction

ariel-frischercommented, May 8, 2020

@ashkaushik I have not got the experience or the time to delve into how the node structure works for this package. I would honestly pay someone to fix these issues if they have some expertise on this library. I just wish this project was better maintained, no updates in a while…

0reactions

kmgreen2commented, Sep 18, 2020

Here is a quick hack. Again, I just started using this package this morning, so I assume I may be missing something. That said, I’ll likely fork and restructure Article to allow a custom extractor, instead of abusing Python’s ability to mutate class internals. Looks like this has been a problem for a long time, so I assume it would take a long time to get a real fix on master.

This hack ignores the heuristic approach to building the “text subtree” directly from the DOM and just builds a new tree of height 2, where the children are filtered text nodes.

from newspaper.extractors import ContentExtractor
from newspaper import Article
from lxml import etree

class TextContextExtractor(ContentExtractor):
    def __init__(self, config):
        ContentExtractor.__init__(self, config)

    def calculate_best_node(self, doc):
        nodes_to_check = self.nodes_to_check(doc)
        root = etree.Element("root")

        for node in nodes_to_check:
            text_node = self.parser.getText(node)
            word_stats = self.stopwords_class(language=self.language). \
                get_stopword_count(text_node)
            high_link_density = self.is_highlink_density(node)
            if word_stats.get_stopword_count() > 2 and not high_link_density:
                text_element = etree.SubElement(root, "foo")
                text_element.text = text_node
        return root

if __name__ == '__main__':
    article = Article('https://www.nytimes.com/2017/02/23/us/politics/cpac-stephen-bannon-reince-priebus.html')
    article.extractor = TextContextExtractor(article.config)
    article.download()
    article.parse()
    print(article.text)

Hope this helps others 😃

Top Results From Across the Web

Plagiarism - APA Style - American Psychological Association

Plagiarism is the act of presenting the words, ideas, or images of another as your own; it denies authors or creators of content...

APA Citation Guide (7th edition) : Paraphrasing

If the title in the References list is in italics, italicize the words from the title in the in-text citation. If you are...

Help - PubMed - NIH

I have some information such as the author, journal name, and publication year. I retrieved too many citations. How can I focus my...

MLA In-Text Citations: The Basics

the article appears in the parenthetical citation which corresponds to the ... Author-page citation for classic and literary works with multiple editions.

5 common mistakes with rel=canonical

Mistake 1: rel=canonical to the first page of a paginated series · Mistake 2: Absolute URLs mistakenly written as relative URLs · Mistake...