Incorrect article text is extracted for multiple articles on some domains.
See original GitHub issueFirst off I would like to thank the creators for making this package free as it is a lifesaver and a timesaver. However, I’d like to address the issues I’m having with the extractor and perhaps find a workaround.
My conda env has:
newspaper3k=0.2.8=py37_0
The following is my sample article which is only extracting text multiple paragraphs below where the article actually begins: NYTIMES Sample. My extracted text begins with:
“In letters to state regulatory boards and in interviews…”
But it should begin with:
For Alyssa Watrous, the medication mix-up meant…
I’ve noticed this is the case for multiple articles on the nytimes website. I’ve just updated my packages, and that did not help. I would appreciate if anyone knows the source of these problems, I know fixing this package to correctly extract all websites perfectly may be unattainable but if there is a way I may look into fixing this myself. Below is my basic setup:
config = Config()
article = Article(url, config=config)
article.download()
article.parse()
article.nlp()
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top GitHub Comments
@ashkaushik I have not got the experience or the time to delve into how the node structure works for this package. I would honestly pay someone to fix these issues if they have some expertise on this library. I just wish this project was better maintained, no updates in a while…
Here is a quick hack. Again, I just started using this package this morning, so I assume I may be missing something. That said, I’ll likely fork and restructure
Article
to allow a custom extractor, instead of abusing Python’s ability to mutate class internals. Looks like this has been a problem for a long time, so I assume it would take a long time to get a real fix on master.This hack ignores the heuristic approach to building the “text subtree” directly from the DOM and just builds a new tree of height 2, where the children are filtered text nodes.
Hope this helps others 😃