question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incorrect article text is extracted for multiple articles on some domains.

See original GitHub issue

First off I would like to thank the creators for making this package free as it is a lifesaver and a timesaver. However, I’d like to address the issues I’m having with the extractor and perhaps find a workaround. My conda env has: newspaper3k=0.2.8=py37_0
The following is my sample article which is only extracting text multiple paragraphs below where the article actually begins: NYTIMES Sample. My extracted text begins with:

“In letters to state regulatory boards and in interviews…”

But it should begin with:

For Alyssa Watrous, the medication mix-up meant…

I’ve noticed this is the case for multiple articles on the nytimes website. I’ve just updated my packages, and that did not help. I would appreciate if anyone knows the source of these problems, I know fixing this package to correctly extract all websites perfectly may be unattainable but if there is a way I may look into fixing this myself. Below is my basic setup:

config = Config()  
article = Article(url, config=config) 
article.download()    
article.parse()    
article.nlp()

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
ariel-frischercommented, May 8, 2020

@ashkaushik I have not got the experience or the time to delve into how the node structure works for this package. I would honestly pay someone to fix these issues if they have some expertise on this library. I just wish this project was better maintained, no updates in a while…

0reactions
kmgreen2commented, Sep 18, 2020

Here is a quick hack. Again, I just started using this package this morning, so I assume I may be missing something. That said, I’ll likely fork and restructure Article to allow a custom extractor, instead of abusing Python’s ability to mutate class internals. Looks like this has been a problem for a long time, so I assume it would take a long time to get a real fix on master.

This hack ignores the heuristic approach to building the “text subtree” directly from the DOM and just builds a new tree of height 2, where the children are filtered text nodes.

from newspaper.extractors import ContentExtractor
from newspaper import Article
from lxml import etree

class TextContextExtractor(ContentExtractor):
    def __init__(self, config):
        ContentExtractor.__init__(self, config)

    def calculate_best_node(self, doc):
        nodes_to_check = self.nodes_to_check(doc)
        root = etree.Element("root")

        for node in nodes_to_check:
            text_node = self.parser.getText(node)
            word_stats = self.stopwords_class(language=self.language). \
                get_stopword_count(text_node)
            high_link_density = self.is_highlink_density(node)
            if word_stats.get_stopword_count() > 2 and not high_link_density:
                text_element = etree.SubElement(root, "foo")
                text_element.text = text_node
        return root

if __name__ == '__main__':
    article = Article('https://www.nytimes.com/2017/02/23/us/politics/cpac-stephen-bannon-reince-priebus.html')
    article.extractor = TextContextExtractor(article.config)
    article.download()
    article.parse()
    print(article.text)

Hope this helps others 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Plagiarism - APA Style - American Psychological Association
Plagiarism is the act of presenting the words, ideas, or images of another as your own; it denies authors or creators of content...
Read more >
APA Citation Guide (7th edition) : Paraphrasing
If the title in the References list is in italics, italicize the words from the title in the in-text citation. If you are...
Read more >
Help - PubMed - NIH
I have some information such as the author, journal name, and publication year. I retrieved too many citations. How can I focus my...
Read more >
MLA In-Text Citations: The Basics
the article appears in the parenthetical citation which corresponds to the ... Author-page citation for classic and literary works with multiple editions.
Read more >
5 common mistakes with rel=canonical
Mistake 1: rel=canonical to the first page of a paginated series · Mistake 2: Absolute URLs mistakenly written as relative URLs · Mistake...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found